I’m happy to start the Open Data Science Course. I expect to deepen my knowledge about the possibilities with R. I found this course when I was looking at the data analysis course list of University of Helsinki. My Github repository for the course can be found here: https://github.com/Santa-Neimane/IODS-project
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Sun Dec 12 16:12:31 2021"
In this part we looked at linear regression to analyse data and how to fit a multiple linear regression model. We also touched upon how to check the assumptions on which the model is based.
Further sections have been written assuming that the reader has an introductory course level understanding of writing and reading R code as well as statistical methods.
date()
## [1] "Sun Dec 12 16:12:31 2021"
In the following analysis I will analyse a dataset originating from a survey Approaches to Learning conducted in an introductory statistics course in Finland. More information and meta data can be found here: https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-meta.txt
Data was set up by:
Original data and script for data set up can be found: IODS-project/data/create_learning2014.R script. Whilst information about how the values were calculated can be found here: https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS2-meta.txt
# Read in the prepared data
data <- read.delim("C:/Users/Localadmin_neimane/Desktop/Data Science/OpenDataScience/IODS-project/data/learning2014.txt")
#Set gender as a factor
data$gender <- as.factor(data$gender)
#Show summary of the data set
summary(data)
## gender Age Attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
The resulting data set consists of 166 observations of 7 variables: gender (110 females, 56 students), age (from 17 till 55 years old, mean age is 26 years), attitude towards statistics, deep approach result (deep), surface approach result (surf), strategic approach result (stra) and exam points.
#Call for the needed packages
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
#Visually explore the dataset
#Create a more advanced plot matrix with ggpairs()
p <- ggpairs(data, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 10)))
p
#checking normal distribution
shapiro.test(data$Attitude)
##
## Shapiro-Wilk normality test
##
## data: data$Attitude
## W = 0.99026, p-value = 0.3145
shapiro.test(data$deep)
##
## Shapiro-Wilk normality test
##
## data: data$deep
## W = 0.9812, p-value = 0.02371
shapiro.test(data$stra)
##
## Shapiro-Wilk normality test
##
## data: data$stra
## W = 0.99068, p-value = 0.3504
shapiro.test(data$surf)
##
## Shapiro-Wilk normality test
##
## data: data$surf
## W = 0.99219, p-value = 0.5073
shapiro.test(data$Points)
##
## Shapiro-Wilk normality test
##
## data: data$Points
## W = 0.96898, p-value = 0.0008932
On the left side you can see the data distribution histograms - majority of the participants were women. Most were in the age between 20 to 30 years old. Variables attitude, stra and surf are in normal distribution according to the Shapiro normality test and histograms. We can observe the correlation between variables Attitude and the exam result Points (R\(^{2}\)=0.44). Interestingly highest correlation (R\(^{2}\)=-0.62) was observed between variables deep and surf, but only gender - men subset of the data. Similarly the same, but a weaker correlation (R\(^{2}\)=-0.37) was observed with the variable surf and Attitude. Whilst some of the boxplots in the upper part of the graph show outliers in the point graphs, for the first analysis we will not remove them since they are not very extreme values. This will be rechecked with model diagnostics.
The task was to fit a regression model where exam points is the target dependent variable and there are 3 explanatory variables. The visual analysis of the highest correlation and linear regression show that model 4 is the best, which includes attitude, stra and age. The following steps to check this were:
model1 <- lm(Points ~ gender + Age +Attitude + deep + stra + surf, data = data)
summary(model1)
##
## Call:
## lm(formula = Points ~ gender + Age + Attitude + deep + stra +
## surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4506 -3.2479 0.3879 3.5243 11.4820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.30246 5.25723 3.481 0.000644 ***
## genderM -0.05858 0.92985 -0.063 0.949845
## Age -0.09607 0.05396 -1.780 0.076929 .
## Attitude 0.34430 0.05978 5.760 4.24e-08 ***
## deep -1.04279 0.78438 -1.329 0.185606
## stra 0.95871 0.55150 1.738 0.084083 .
## surf -1.10891 0.84467 -1.313 0.191132
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.265 on 159 degrees of freedom
## Multiple R-squared: 0.2312, Adjusted R-squared: 0.2021
## F-statistic: 7.967 on 6 and 159 DF, p-value: 1.599e-07
model2 <- lm(Points ~ Age +Attitude + deep + stra + surf, data = data)
summary(model2)
##
## Call:
## lm(formula = Points ~ Age + Attitude + deep + stra + surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4759 -3.2812 0.3804 3.5381 11.4897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.30059 5.24076 3.492 0.00062 ***
## Age -0.09650 0.05335 -1.809 0.07234 .
## Attitude 0.34320 0.05699 6.022 1.14e-08 ***
## deep -1.04374 0.78179 -1.335 0.18375
## stra 0.96551 0.53917 1.791 0.07523 .
## surf -1.10529 0.84009 -1.316 0.19016
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.249 on 160 degrees of freedom
## Multiple R-squared: 0.2311, Adjusted R-squared: 0.2071
## F-statistic: 9.62 on 5 and 160 DF, p-value: 4.803e-08
Variables surf and deep are not significant so they are removed one by one.
model3 <- lm(Points ~ Age + Attitude + deep + stra, data = data)
summary(model3)
##
## Call:
## lm(formula = Points ~ Age + Attitude + deep + stra, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9941 -3.0839 0.5037 3.5519 11.3298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.24374 3.57079 3.709 0.000286 ***
## Age -0.08750 0.05303 -1.650 0.100866
## Attitude 0.35388 0.05654 6.259 3.37e-09 ***
## deep -0.73226 0.74677 -0.981 0.328273
## stra 1.05032 0.53652 1.958 0.051998 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.261 on 161 degrees of freedom
## Multiple R-squared: 0.2228, Adjusted R-squared: 0.2035
## F-statistic: 11.54 on 4 and 161 DF, p-value: 2.915e-08
And remove variable deep from the model
model4 <- lm(Points ~ Age + Attitude + stra, data = data)
summary(model4)
##
## Call:
## lm(formula = Points ~ Age + Attitude + stra, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## Age -0.08822 0.05302 -1.664 0.0981 .
## Attitude 0.34808 0.05622 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
The Residuals section of the results shows how the actual results differ from the predicted values from the model with specific value region details (minimal value, first quater and so on). Median difference in between the result and model values is 0.33 points. The Intercept shows the mean for the dependent variable (in this model 10.9) when all of the explanatory variables would be 0. The regression equation that explains obtained points in the exam is: Points = 10.9 - (0.09xAge) + (0.35xAttitude) + stra
The three asterisks marks highlights the significance of Attitude. While the dot shows that Age and stra variable p-values are near chosen significance level (0.05). Residual standard error and F statistics shows how well the model can predict data and the degrees of freedom show the amount of data to estimate the parameters (165 datapoints - 3 variables we are looking at). The multiple R squared of the model shows the proportion of the variance in the dependent variable of a regression model that can be explained by the explanatory variables. The value ranges from 0 - 1. In this model the three explanatory variables together account for 22% of the variation in persons obtained exam result points. The result also implies that Attitude is associated more strongly with the obtained points in the exam than age and stra (strategic approach).
As Age explanatory variable in the model does not have a statistically significant relationship with the obtained exam points, the variable will be removed from the model.
model5 <- lm(Points ~ Attitude + stra, data = data)
summary(model5)
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## Attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Stra variable still has not reached significance threshold.
model6 <- lm(Points ~ Attitude, data = data)
summary(model6)
##
## Call:
## lm(formula = Points ~ Attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
#Determine the AIC value for the models to chose the best fitting one
AIC(model1)
## [1] 1031.446
AIC(model2)
## [1] 1029.45
AIC(model3)
## [1] 1029.236
AIC(model4)
## [1] 1028.225
AIC(model5)
## [1] 1029.038
AIC(model6)
## [1] 1029.988
Even though only attitude is significant, the other factors are close to significance and slightly improve the model (the lowest AIC score) thus the model4 where all 3 explanatory variables were included (age + Attitude + stra) is chosen for deeper analysis.
#Load needed package
library(car)
## Loading required package: carData
#Calculate Variance Inflation Factor of the chosen model
vif(model4)
## Age Attitude stra
## 1.010865 1.004083 1.014233
Multiple regression assumes that the explanatory variables are independent and do not highly correlate with each other.VIF values are about 1 so we do not need to worry about multicollinearity in the regression model.
To presume that the built model is valid multiple regression assumptions have to be met. Like the relationship between the dependent and explanatory variables ought to be linear, the residual errors need to be normally distributed and they have a constant variance. These assumptions can be checked with diagnostic plots and tests.
#Plot diagnostic plots (Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage) for the chosen model4.
par(mfrow = c(2,2))
plot(model4, which = c(1,2,5))
With the First graph Residuals vs Fitted we can check the linear relationship assumption. As we can see a horizontal line with no pattern and the red line is around 0, we can assume a linear relationship. With the second graph Normal Q&Q we can check the residual normality distribution assumption. Since there are no large deviations from the line we can move on. The third graph Residuals vs Leverage shows if there are any influential values (as outlier or a high leverage point). Whilst there point 2 and 4 have a rather large leverage value, they do not go over the Cook’s distance lines. Cook’s distance line is not shown on the plot because all points are well inside of the Cook’s distance lines. Also, the spread of standardized residuals does not seem to change with increased leverage. Thus we may interpret the developed model as valid, but as we remember it did not have a high explanatory power. So for further surveys other explanatory variables ought to be considered.
In this part we looked at logistic regression to analyse student alcohol consumption data.
date()
## [1] "Sun Dec 12 16:12:39 2021"
In the following analysis I will analyse a data originating from a data set Student achievement in secondary education of two Portuguese schools. More information and data can be found here: https://archive.ics.uci.edu/ml/datasets/Student+Performance
Original data and script for data set up can be found: IODS-project/data/create_alc.R script.
Data was set up by:
#Read in the prepared data
data <- read.csv("C:/Users/Localadmin_neimane/Desktop/Data Science/OpenDataScience/IODS-project/data/JoinedMathPorData.txt", sep="")
#Show names of the variables
names(data)
## [1] "school" "sex" "age" "address"
## [5] "famsize" "Pstatus" "Medu" "Fedu"
## [9] "Mjob" "Fjob" "reason" "guardian"
## [13] "traveltime" "studytime" "failures.math" "schoolsup"
## [17] "famsup" "paid.math" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel"
## [25] "freetime" "goout" "Dalc" "Walc"
## [29] "health" "absences.math" "G1.math" "G2.math"
## [33] "G3.math" "failures.por" "paid.por" "absences.por"
## [37] "G1.por" "G2.por" "G3.por" "alc_use"
## [41] "high_use"
Data set includes information like, the sex and age (15 - 22), family size (less or more than 3 persons), parent education level, travel time to school, internet access and much more. There are in total 370 observations with 41 column. Variable ending .math or .por indicates the variable being source from Mathematics (.math) or Portuguese language (.por).
For further analysis I will study the relationships between high/low alcohol consumption and the following variables:
These variables were chosen to test if lack of hobbies and close personal relationships results in increased alcohol consumption among the students. Lack of hobbies is represented with variable activities and freetime. Whilst absence of close personal relationships can be determined trough variables romantic and famrel. At this point I expected that with no activities and very high free time, no romantic relationship and 1-very bad quality of family relationship will result in increased alcohol consumption.
In the following sections on the left you can see the barplots with the number (count) of the answers, whilst on the right one can see the proportion. The graph on the left is presented so the reader can see see the data distribution among groups, whilst the right to see the proportions and thus possibility to conclude about the possible factors that affect the alcohol consumption.
data$famrel <- factor(data$famrel)
data$freetime <- factor(data$freetime)
data$activities <- factor(data$activities)
data$romantic <- factor (data$romantic)
#Load the packages
library(ggplot2)
library(cowplot)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Plot the relationship of activities and alcohol consumption
GraphAct <- ggplot(data,
aes(x = activities)) +
geom_bar(position = "stack")+
ggtitle("Extra-curricular activities")
GraphActProp <- ggplot(data,
aes(x = activities,
fill = high_use)) +
geom_bar(position = "fill") +
labs(y = "Proportion")
plot_grid(GraphAct, GraphActProp)
In the Extra-curricular activities graphs we can firstly see that the number of students having and not having activities is similar (~180). The proportion of high alcohol consumption is among students with no extra-curricular activities is slightly higher.
# Plot the relationship of freetime and alcohol consumption
GraphFTime <- ggplot(data,
aes(x = freetime)) +
geom_bar(position = "stack")+
ggtitle("Free time after school")+
labs(x = "free time after school
(from 1 - very low to 5 - very high)")
GraphFTimeProp <- ggplot(data,
aes(x = freetime,
fill = high_use)) +
geom_bar(position = "fill") +
labs(y = "Proportion")+
labs(x = "free time after school
(from 1 - very low to 5 - very high)")
plot_grid(GraphFTime, GraphFTimeProp)
The second graph Free time after school division in free time groups is not equal, there are more students that had an average amount free time after school than with a high and very low amount of free time. The graph with proportions clearly approve the hypotheses that alcohol consumtion increases with more available free time.
# Plot the relationship of romantic relationship and alcohol consumption
GraphRom <- ggplot(data,
aes(x = romantic)) +
geom_bar(position = "stack")+
ggtitle("with a romantic relationship")
GraphRomProp <- ggplot(data,
aes(x = romantic,
fill = high_use)) +
geom_bar(position = "fill") +
labs(y = "Proportion")
plot_grid(GraphRom, GraphRomProp)
In the third graph with a romantic relationship we can see that most of the students did not have a romantic relationship. Whilst proportion graph shows that the students with romantic relationship had only slightly lower proportion of students with high alcohol consumption.
# Plot the relationship of family relationship and alcohol consumption
GraphFamR <- ggplot(data,
aes(x = famrel)) +
geom_bar(position = "stack")+
ggtitle("quality of family relationships")+
labs(x="quality of family relationships
(from 1 - very bad to 5 - excellent)")
GraphFamRProp <- ggplot(data,
aes(x = famrel,
fill = high_use)) +
geom_bar(position = "fill") +
labs(y = "Proportion")+
labs(x="quality of family relationships
(from 1 - very bad to 5 - excellent)")
plot_grid(GraphFamR, GraphFamRProp)
In the forth quality of family relationships graph we can see that most of the students had good and excellent relationships with family. In the proportions we can see that with increased quality of family relationships there are less students that consume high amount of alcohol, thus confirming the hypothesis that students who have good family relationships consume less alcohol. As for the proportion of students with very bad relationships, on the left we can see that the there only few students, perhaps due to this reason the proportion does not fit the general pattern.
In this section multiple models will be built. The best fitting one will be analysed in detail, the predictive power will be explored and the model will be cross-validated. First I will build a model with all of the variables that were selected.
#Seting the chosen variables as factors
data$famrel <- factor(data$famrel)
data$freetime <- factor(data$freetime)
data$activities <- factor(data$activities)
data$romantic <- factor (data$romantic)
#building model with glm() with all of the factors
model1 <- glm(high_use ~ famrel + romantic + freetime + activities, data = data, family=binomial)
#showing summary
summary(model1)
##
## Call:
## glm(formula = high_use ~ famrel + romantic + freetime + activities,
## family = binomial, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4553 -0.8440 -0.6844 1.2657 2.1236
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4628 1.1240 -2.191 0.0284 *
## famrel2 0.7507 0.9719 0.772 0.4399
## famrel3 0.4070 0.8813 0.462 0.6442
## famrel4 -0.1789 0.8612 -0.208 0.8355
## famrel5 -0.5407 0.8827 -0.613 0.5402
## romanticno 0.1975 0.2571 0.768 0.4424
## freetime2 1.0093 0.8210 1.229 0.2189
## freetime3 1.2523 0.7856 1.594 0.1109
## freetime4 1.8474 0.7925 2.331 0.0197 *
## freetime5 2.0287 0.8423 2.409 0.0160 *
## activitiesno 0.3002 0.2372 1.266 0.2056
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 428.30 on 359 degrees of freedom
## AIC: 450.3
##
## Number of Fisher Scoring iterations: 4
Activities after school and romantic relationship factors are not significant suggesting there is no association with the probability of having high alcohol consumption.
#the model with glm() without non-significant romantic relationship factor
model2 <- glm(high_use ~ famrel + freetime + activities, data = data, family=binomial)
summary(model2)
##
## Call:
## glm(formula = high_use ~ famrel + freetime + activities, family = binomial,
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4314 -0.8199 -0.7056 1.2822 2.1381
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3755 1.1214 -2.118 0.0341 *
## famrel2 0.8276 0.9682 0.855 0.3927
## famrel3 0.4821 0.8773 0.550 0.5826
## famrel4 -0.1132 0.8580 -0.132 0.8950
## famrel5 -0.4541 0.8763 -0.518 0.6043
## freetime2 0.9759 0.8197 1.191 0.2338
## freetime3 1.2253 0.7847 1.562 0.1184
## freetime4 1.8174 0.7913 2.297 0.0216 *
## freetime5 1.9912 0.8406 2.369 0.0179 *
## activitiesno 0.3103 0.2366 1.311 0.1897
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 428.89 on 360 degrees of freedom
## AIC: 448.89
##
## Number of Fisher Scoring iterations: 4
Also this model does not suggest an association between activities and alcohol consumption.
#the model with glm() without non-significant activities factor
model3 <- glm(high_use ~ famrel + freetime, data = data, family=binomial)
summary(model3)
##
## Call:
## glm(formula = high_use ~ famrel + freetime, family = binomial,
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3729 -0.8366 -0.6810 1.3462 2.1895
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1844 1.1042 -1.978 0.0479 *
## famrel2 0.8418 0.9637 0.874 0.3824
## famrel3 0.4810 0.8726 0.551 0.5815
## famrel4 -0.1171 0.8532 -0.137 0.8908
## famrel5 -0.4769 0.8715 -0.547 0.5843
## freetime2 0.9580 0.8208 1.167 0.2431
## freetime3 1.1989 0.7864 1.525 0.1274
## freetime4 1.7913 0.7931 2.258 0.0239 *
## freetime5 1.9117 0.8400 2.276 0.0229 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 430.62 on 361 degrees of freedom
## AIC: 448.62
##
## Number of Fisher Scoring iterations: 4
# Building the simples possible model
model4 <- glm(high_use ~ freetime, data = data, family=binomial)
summary(model4)
##
## Call:
## glm(formula = high_use ~ freetime, family = binomial, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0383 -0.7815 -0.7815 1.3893 2.0688
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0149 0.7528 -2.677 0.00744 **
## freetime2 0.8253 0.8123 1.016 0.30961
## freetime3 0.9853 0.7750 1.271 0.20359
## freetime4 1.5294 0.7791 1.963 0.04965 *
## freetime5 1.6784 0.8252 2.034 0.04195 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 441.17 on 365 degrees of freedom
## AIC: 451.17
##
## Number of Fisher Scoring iterations: 4
#Exploring how model change with removing variables
AIC(model1,model2, model3, model4)
## df AIC
## model1 11 450.2989
## model2 10 448.8948
## model3 9 448.6203
## model4 5 451.1664
Model 3 has the lowest AIC score.
anova(model1, model2, model3, model4, test="LRT")
## Analysis of Deviance Table
##
## Model 1: high_use ~ famrel + romantic + freetime + activities
## Model 2: high_use ~ famrel + freetime + activities
## Model 3: high_use ~ famrel + freetime
## Model 4: high_use ~ freetime
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 359 428.30
## 2 360 428.89 -1 -0.5959 0.44013
## 3 361 430.62 -1 -1.7255 0.18899
## 4 365 441.17 -4 -10.5460 0.03217 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model4 is significantly different according to the likelihood ratio test.
Data visual exploration coincided with the outputs of models. Factors freetime (free time) and famrel *(family relationships) effected the probability of alcohol consumption by the students. According to the graphical exploration, Akaike information criterion and likelihood ratio test the best model is model 3 thus it will be used for further analysis.
In the original model3 version we see that the reference value for family relationships is famrel1 (very bad relationship). As we recall this was a rather problematic case, because of the very small amount of students thus we change the reference value from very bad 1 to 3 average. The same problem occurs with the variable freetime thus we change it to average free time to student as well.
#Set the average family relationship and average free time as the reference
data <- within(data, famrel <- relevel(famrel, ref = 3))
data <- within(data, freetime <- relevel(freetime, ref = 3))
#Re-run the model
model3 <- glm(high_use ~ famrel + freetime, data = data, family=binomial)
summary(model3)
##
## Call:
## glm(formula = high_use ~ famrel + freetime, family = binomial,
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3729 -0.8366 -0.6810 1.3462 2.1895
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.5045 0.3056 -1.650 0.09884 .
## famrel1 -0.4810 0.8726 -0.551 0.58147
## famrel2 0.3608 0.5510 0.655 0.51258
## famrel4 -0.5982 0.3162 -1.892 0.05854 .
## famrel5 -0.9579 0.3620 -2.646 0.00815 **
## freetime1 -1.1989 0.7864 -1.525 0.12735
## freetime2 -0.2409 0.3651 -0.660 0.50941
## freetime4 0.5924 0.2783 2.129 0.03328 *
## freetime5 0.7128 0.3917 1.820 0.06881 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 430.62 on 361 degrees of freedom
## AIC: 448.62
##
## Number of Fisher Scoring iterations: 4
In the summary we can see that going to bad family relationships (famrel2) from average relationships (famrel3) increase the log odds of the student consuming high amounts of alcohol by 0.36. But from average family relationship to excellent (famrel5) relationship decrease the odds by 0.96. Freetime also shows that from average free time (the reference freetime3) to a lot of free time (freetime) the log odds of high alcohol consumption increases by 0.71. On the other hand, from average to very limited amount of free time (freetime1) the log odds to get high alcohol consumption decreases by 1.2. The difference between Null deviance and Residual deviance shows that the model is good fit, since there is a big difference. Residual deviance shows the case when the variables are included, but only intercept value for the null deviance.
#calculate odds ratios (OR)
OR <- coef(model3) %>% exp
#calculate confidence intervals (CI)
CI <- confint(model3) %>% exp
## Waiting for profiling to be done...
#present the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.6038340 0.32804427 1.0930307
## famrel1 0.6181520 0.08398845 3.0504020
## famrel2 1.4344737 0.48258059 4.2790030
## famrel4 0.5498183 0.29574568 1.0258361
## famrel5 0.3837050 0.18707811 0.7770458
## freetime1 0.3015267 0.04552692 1.1623551
## freetime2 0.7859174 0.37459180 1.5802216
## freetime4 1.8082833 1.04899532 3.1300954
## freetime5 2.0396661 0.93587347 4.3848926
In the above table the estimated odds ratios and 95% CIs for each variable can be seen. The obtained results coincide with the hypotheses that lack of close family increase the odds of student to have high alcohol consumption (odd of students with bad family relationships having high alcohol are increased (1.43) in comparison with students who have average family relationships). The opposite can be seen with students who have good and excellent relationships, the odds ratio is less than 1. It was also in the beginning of the analysis correctly hypothesized that students who have good family relationships and fulfilled free time are less likely to highly consume alcohol.
Further we will model prediction ability.
# calculate the probability of high_use
probabilities <- predict(model3, type = "response")
# add the predicted probabilities
data <- mutate(data, probability = probabilities)
# use the probabilities to make a prediction of high_use
data <- mutate(data, prediction = probability > 0.5)
ResultProbTable <- select(data, freetime, famrel, high_use, probability, prediction)
library(DT)
#Made a smaller more interactive result table otherwise it takes up a lot of space in the document. Feel free to scroll trough it
datatable(ResultProbTable, options = list(pageLength = 5))
# tabulate the target variable versus the predictions
table(high_use = data$high_use, prediction = data$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 245 14
## TRUE 98 13
Of the 259 FALSE values the developed model wrongly assigned 14 to TRUE. Most of the TRUE values model predicted to be FALSE. The probability versus the alcohol use can be seen in the graph.
# Graph the results
ggplot(data, aes(x = probability, y = high_use))+
geom_point(position=position_jitter(h=0.000001,w=0.1), aes(colour = prediction))+
theme_bw()
#jitter has been added since the points closely overlap
# tabulate the target variable versus the predictions
table(high_use = data$high_use, prediction = data$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66216216 0.03783784 0.70000000
## TRUE 0.26486486 0.03513514 0.30000000
## Sum 0.92702703 0.07297297 1.00000000
The table above shows how the model incorrectly underestimates proportion of TRUE values and overestimates the FALSE amount of points. In the data set 0.3 are true whilst only model predicted proportion would be much smaller 0.07.
#define average prediction error
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
#calculate the average number of wrong predictions
loss_func(class = data$high_use, prob = data$probability)
## [1] 0.3027027
Total proportion of inaccurately classified individuals is 0.3. And as we saw before most of those are TRUE values that were assigned according to the model to FALSE. Still this model would better predict (~70%) than random guessing with 50% probability.
Here 10-fold cross-validation is used to determine the models predictive power by sub-setting the original data set and testing the the models predictive power against the part of the data (sub-set) that was not used to build the model.
# K-fold cross-validation
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:car':
##
## logit
cv <- cv.glm(data = data, cost = loss_func, glmfit = model3, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3243243
Model3 has worse (0.32) test set performance (smaller prediction error using 10-fold cross-validation) compared to the model introduced in DataCamp (which had about 0.26 error). The performance of model3 can be improved by adding previously unexplored variables into the model.
In this task we will use data from the MASS package, the Boston data set. This data set describes housing values in suburbs of Boston. All the information about this data set can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html In short, data set consists of multiple different variables, like nitrogen oxides concentration, per capita crime rate by town, pupil-teacher ratio by town and others. Abbreviations can be seen here: crim - per capita crime rate by town.
zn - proportion of residential land zoned for lots over 25,000 sq.ft.
indus - proportion of non-retail business acres per town.
chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox - nitrogen oxides concentration (parts per 10 million).
rm - average number of rooms per dwelling.
age - proportion of owner-occupied units built prior to 1940.
dis - weighted mean of distances to five Boston employment centres.
rad - index of accessibility to radial highways.
tax - full-value property-tax rate per $10,000.
ptratip - pupil-teacher ratio by town.
black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat - lower status of the population (percent).
medv - median value of owner-occupied homes in $1000s.
#Load the needed package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2)
library(GGally)
# Load the dataset
data("Boston")
# Explore the structure and the dimensions
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))
Data set consists of 506 rows and 14 variables. There is a large variability between per capita crime rate by town, the minimal value is 0.006 and the maximal value is 88.98 with the mean of 6.14. Similar with zn - proportion of residential land zoned for lots over 25,000 sq.ft. and black - the proportion of blacks by town. Chas Charles River dummy variable is a binary variable. nox nitrogen oxides concentration (parts per 10 million) and dis weighted mean of distances to five Boston employment centres seem to be correlated. Just so as rm average number of rooms per dwelling and lstat lower status of the population (percent) and medv median value of owner-occupied homes. It seems that medv and lstat are correlate as well.
#Standardize the dataset
boston_scaled <- scale(Boston)
#Change from object
boston_scaled <- as.data.frame(boston_scaled)
#See the change
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
After data transformation with scaling the data the range of data values do not differ so widely.
Next we will create a categorical variable of the crime rate and use the quantiles as the break points.
#Create a quantile vector of crim rate
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
#Add new variable
boston_scaled$crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
library(dplyr)
#Remove the old variable
boston_scaled <- dplyr::select(boston_scaled, -crim)
Here I divide the dataset to train (with 80% of the data) and test set.
n <- nrow(boston_scaled)
#Choose randomly 80% of data
ind <- sample(n, size = n * 0.8)
#Create train set
train <- boston_scaled[ind,]
#Create test set
test <- boston_scaled[-ind,]
Here I will fit the linear discriminant analysis on the train set and draw the LDA (bi)plot. Crime rate will be used as the categorical target variable and all the other variables in the dataset as predictor variables.
# Fit the analysis
lda.fit <- lda(crime ~ ., data = train)
# print the analysis result
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2301980 0.2599010 0.2549505
##
## Group means:
## zn indus chas nox rm age
## low 0.95434742 -0.9172420 -0.081207697 -0.8783694 0.42276968 -0.9208377
## med_low -0.04694271 -0.3467178 -0.060657012 -0.5872656 -0.14538574 -0.3633394
## med_high -0.38760189 0.2132033 0.140129052 0.3986156 -0.03237705 0.4186367
## high -0.48724019 1.0170891 -0.004759149 1.0735634 -0.43303206 0.8443271
## dis rad tax ptratio black lstat
## low 0.8976932 -0.6841618 -0.7430835 -0.42522185 0.38195183 -0.78216529
## med_low 0.3417202 -0.5323637 -0.4761438 -0.04104811 0.34094948 -0.16497785
## med_high -0.3519418 -0.4317008 -0.3070061 -0.17038162 0.07794698 0.08955366
## high -0.8651215 1.6384176 1.5142626 0.78111358 -0.76071286 0.89140494
## medv
## low 0.533537454
## med_low 0.014905360
## med_high 0.008859244
## high -0.702182881
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.155881532 0.52295496 -0.88793175
## indus -0.024346040 -0.29462963 0.24698588
## chas -0.054930861 -0.02907502 0.03882079
## nox 0.387868858 -0.76808734 -1.40327311
## rm -0.087842029 -0.09941379 -0.22763496
## age 0.289745515 -0.35672188 -0.11289738
## dis -0.119703397 -0.17944309 -0.01928404
## rad 3.171395340 1.03634080 -0.07263717
## tax 0.008791534 0.04102381 0.58137249
## ptratio 0.188350315 -0.08077894 -0.30812508
## black -0.129544789 0.03163498 0.16555584
## lstat 0.244863019 -0.17795630 0.33393499
## medv 0.176649989 -0.28129505 -0.11647492
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9530 0.0373 0.0096
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "blue4", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
Based on the biplot, high separated very clearly. Based on arrows, rad index of accessibility to radial highways explained more for high values than for others. There is an overlap between low,med_low, med_high.
Now we will cross tabulate the results with the crime categories from the test set.
#Obtain the true classes for the variables of test dataset
correct_classes <- test$crime
#Remove the crime classes from test data set
test <- dplyr::select(test, -crime)
#Predict classes with the fitted linear discriminant analysis
lda.pred <- predict(lda.fit, newdata = test)
#Cross tabulate the results with the crime categories from the test set
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 18 4 2 0
## med_low 5 17 11 0
## med_high 1 2 17 1
## high 0 0 0 24
The fitted linear discriminant analysis correctly divided all high values from the test data set. Also most of true med_low were correctly assigned. Worse performance was for med_high where variables were the analysis could not separate these values apart from med_low group. It has to be taken into account that the model could change depending on the random data set deviation into the train and test sets.
Further we will investigate what would have been the optimal number of clusters.
#Make a new scaled version of the dataset
library(MASS)
data("Boston")
boston_scaled2 <- scale(Boston)
#Calculate euclidean distance
dist_eu <- dist(boston_scaled2)
#Summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
km <-kmeans(boston_scaled2, centers = 2)
#Datapoints divided in 2 clusters
pairs(boston_scaled2, col = km$cluster)
km <-kmeans(boston_scaled2, centers = 3)
#Datapoints divided in 3 clusters
pairs(boston_scaled2, col = km$cluster)
km <-kmeans(boston_scaled2, centers = 4)
#Datapoints divided in 4 clusters
pairs(boston_scaled2, col = km$cluster)
km <-kmeans(boston_scaled2, centers = 5)
#Datapoints divided in 5 clusters
pairs(boston_scaled2, col = km$cluster)
As it is hard to visually determine the number of clusters, I will compare the sum of squared error (SSE) for a number of cluster solutions to choose the appropriate number of clusters.
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(boston_scaled2, kmeans, method = "wss")
The output confirmed that more than 2 clusters are needed.
Lets determine the number clusters according to the Bayesian Information Criterion for expectation-maximization here:
library(mclust)
## Package 'mclust' version 5.4.8
## Type 'citation("mclust")' for citing this R package in publications.
# Run the function to see how many clusters (from 1 to 20)
d_clust <- Mclust(as.matrix(boston_scaled2), G=1:20)
m.best <- dim(d_clust$z)[2]
cat("model-based optimal number of clusters:", m.best, "\n")
## model-based optimal number of clusters: 5
Thus we can conclude that the data could be well divided in 5 clusters.
Data set used in the following was build from United Nations Development Programme Human Development Report data which focused on evaluation of human development (length of life and health, knowledge, standard of living) and how gender inequality affects this development (female reproductive health, labour market and parliamentary gender representation).
Original data sets can be found here: http://hdr.undp.org/en/content/human-development-index-hdi and here:http://hdr.undp.org/sites/default/files/hdr2015_technical_notes.pdf Data preparation can be found in scripts create_human.R and further edits in human_edited.R.
Short description of the data set variables:
SecEdRatioFM - calculated female / male populations with secondary education in each country LabForceRatioFM - calculated ratio of labour force participation of females and males in each country ExpecYearsEd - Expected years of education LifeExpec - Life expectancy at Birth GNI - Gross national income per capita MatMortRat - Maternal mortality ratio AdolecBirthRate - Adolescent birth rate RepresParl - Percent representation in parlament
#Read in the data
data <- read.csv("C:/Users/Localadmin_neimane/Desktop/Data Science/OpenDataScience/IODS-project/data/humanEdit.txt", sep="")
#Load packages
library(ggplot2)
library(GGally)
#Plot
ggpairs(data, lower = list(combo = wrap("facethist", bins = 20)))
Many of the explored variables are correlated. Highest correlation coefficient (-0.857) maternal mortality ratio and Life expectancy at Birth. In countries with high maternal mortality ratio it is expected that life will be shorter. Strong correlation can also be observed between Life expectancy at Birth & Expected years of education, Adolescent birth rate & Life expectancy at Birth, maternal mortality ratio & Expected years of education. Of the variables Expected years of education variable seem to follow normal distribution. While GNI and AdolecBirthRate seem to have log normal distribution. RepresParl appear to have Poisson distribution.And the rest perhaps follow beta distribution.
#Summaries of the variables in the data set
summary(data)
## SecEdRatioFM LabForceRatioFM ExpecYearsEd LifeExpec
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI MatMortRat AdolecBirthRate RepresParl
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
All of the examined variables have a wide range of values thus showing how different the examined countries are. In average there are slightly more countries where males have secondary education than women, in most countries there are less women in the labour force, people have expected 14 years of education. The average life expectancy at birth is 72 years, GNI is 17628, maternal mortality ratio is 149, adolescent birth rate is 47 and there are 21% women in the parliament.
First of all we will look at the PCA when the data has not been scaled. Afterwards we will scale the data and run the analysis again.
#Run the analysis on non-scaled data
dataPCA <- prcomp(data)
#Call summary
s <- summary(dataPCA)
#Add precetentages
pca_pr <- round(100*s$importance[2,], digits = 1)
#Make label names
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
#Plot the result with
plotnoSTD <- biplot(dataPCA, cex = c(0.8, 1), col = c("grey40", "blue"), xlab = pc_lab[1], ylab = pc_lab[2],
sub="The non-standardized data can not be used for PCA (GNI sd affects PC1 unfairly)")
Let’s scale the data to standardize it and re-run the analysis. This time we will use fviz_pca_biplot() function to make more aesthetically pleasing graph.
#Scale data
data_STD <- scale(data)
library("factoextra")
library("FactoMineR")
#Run the analysis
data_STD.pca <- PCA(data_STD, graph = FALSE)
#Make plot
plotSTD <- fviz_pca_biplot(data_STD.pca, repel = TRUE,
col.var = "#2E9FDF",
col.ind = "#c3c3c3")
#Add caption
ggpubr::ggpar(plotSTD, caption = "Rich countries (high GNI) have better life and years of expectancy,
just as higher female to male secondary education ratio.
Whilst countries with high maternal mortality rate also have high adolescent birth rate.")
## Warning: ggrepel: 122 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
The results differ between non-standardized data analysis and standardized data analysis. In the analysis with non-standardized data the principal component explained all of the variance. This is because PCA calculates the new projection based on the standard deviations of the variables in the analysis. In our data set GNI has a very large standard deviation thus it has higher weight on the new projection. Thus to calculate relevant projection standardized data with same deviation is needed.
Almost 70% of the variance in the data is explained by the first two dimensions of the PCA. ExpecYearsEd, GNI, SecEdRatioFm and LifeExpec are highly correlated. MatMortRat and AdolecBirthRate are also highly correlated. These variables are important contributors to the first principal component. To the second component LabForceRatioFM and RepresParl are the most important contributor. Countries that are close to each other in the biplot represent observations with similar values.
I will use the tea dataset from the package FactoMineR to perform multiple correspondence analysis.
#Load package
library(FactoMineR)
data("tea")
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
Data consists of 300 rows and 36 columns. Data describes how the tea is drunk, how tea is perceived and some personal details about the participants. All variables apart from age are factorial variables. Most of them are binary. Price has 6 levels.
Data visualized here:
library(mlbench)
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:mclust':
##
## map
## The following object is masked from 'package:car':
##
## some
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
#Read in data
data("tea")
#Automatically plot all of the data
marrangeGrob(
map(
names(tea),
~ ggplot(tea, aes_string(.x)) +
geom_bar()+
theme_bw()
),
ncol = 4,
nrow = 9,
top = "tea Distribution"
)
#Multiple Correspondence Analysis on the tea data
#Building the model with active variables describing how they drink tea, age variable (19) as quantitative supplementary variable and other variables describing personal details and product perception as qualitative supplementary variables.
teaMCA <- MCA (tea, quanti.sup = 19, quali.sup=c(20:36), graph = FALSE)
#Busy graphs are produced, so adjustments will be made and presented after looking at the summary of the analysis.
#First let's check the summary of the model with all categorical variables
summary(teaMCA, nbelements = Inf)
##
## Call:
## MCA(X = tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.148 0.122 0.090 0.078 0.074 0.071 0.068
## % of var. 9.885 8.103 6.001 5.204 4.917 4.759 4.522
## Cumulative % of var. 9.885 17.988 23.989 29.192 34.109 38.868 43.390
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.065 0.062 0.059 0.057 0.054 0.052 0.049
## % of var. 4.355 4.123 3.902 3.805 3.628 3.462 3.250
## Cumulative % of var. 47.745 51.867 55.769 59.574 63.202 66.664 69.914
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.048 0.047 0.046 0.040 0.038 0.037 0.036
## % of var. 3.221 3.127 3.037 2.683 2.541 2.438 2.378
## Cumulative % of var. 73.135 76.262 79.298 81.982 84.523 86.961 89.339
## Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27
## Variance 0.035 0.031 0.029 0.027 0.021 0.017
## % of var. 2.323 2.055 1.915 1.821 1.407 1.139
## Cumulative % of var. 91.662 93.717 95.633 97.454 98.861 100.000
##
## Individuals
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.541 0.658 0.143 | -0.149 0.061 0.011 | -0.306
## 2 | -0.361 0.293 0.133 | -0.078 0.017 0.006 | -0.633
## 3 | 0.073 0.012 0.003 | -0.169 0.079 0.018 | 0.246
## 4 | -0.572 0.735 0.235 | 0.018 0.001 0.000 | 0.203
## 5 | -0.253 0.144 0.079 | -0.118 0.038 0.017 | 0.006
## 6 | -0.684 1.053 0.231 | 0.032 0.003 0.001 | -0.018
## 7 | -0.111 0.027 0.022 | -0.182 0.090 0.059 | -0.207
## 8 | -0.210 0.099 0.043 | -0.068 0.013 0.004 | -0.421
## 9 | 0.118 0.031 0.012 | 0.229 0.144 0.044 | -0.538
## 10 | 0.258 0.150 0.045 | 0.478 0.627 0.156 | -0.482
## 11 | -0.177 0.070 0.015 | -0.291 0.233 0.042 | 0.219
## 12 | -0.204 0.093 0.062 | -0.225 0.139 0.076 | 0.177
## 13 | 0.009 0.000 0.000 | -0.101 0.028 0.006 | -0.579
## 14 | -0.392 0.345 0.164 | -0.206 0.117 0.045 | -0.324
## 15 | -0.184 0.076 0.027 | 0.289 0.229 0.066 | -0.685
## 16 | -0.032 0.002 0.001 | 0.533 0.780 0.219 | -0.152
## 17 | 0.419 0.394 0.139 | 0.278 0.212 0.061 | -0.374
## 18 | 0.329 0.243 0.120 | -0.044 0.005 0.002 | -0.291
## 19 | -0.066 0.010 0.003 | -0.239 0.157 0.043 | -0.359
## 20 | -0.194 0.085 0.035 | -0.055 0.008 0.003 | -0.347
## 21 | 0.151 0.052 0.021 | -0.014 0.001 0.000 | 0.020
## 22 | 0.613 0.845 0.120 | 0.107 0.032 0.004 | -0.870
## 23 | -0.301 0.203 0.034 | 0.116 0.037 0.005 | 0.257
## 24 | -0.121 0.033 0.014 | 0.152 0.064 0.023 | -0.363
## 25 | 0.094 0.020 0.007 | 0.228 0.143 0.040 | 0.228
## 26 | 0.641 0.924 0.140 | -0.371 0.377 0.047 | 0.075
## 27 | 0.077 0.013 0.007 | -0.178 0.087 0.041 | -0.079
## 28 | 0.508 0.579 0.202 | -0.356 0.348 0.099 | 0.015
## 29 | -0.156 0.055 0.035 | -0.024 0.002 0.001 | -0.288
## 30 | -0.073 0.012 0.006 | -0.137 0.051 0.022 | -0.204
## 31 | -0.270 0.163 0.042 | 1.026 2.885 0.609 | 0.201
## 32 | 0.492 0.545 0.183 | -0.136 0.051 0.014 | 0.019
## 33 | 0.156 0.055 0.026 | -0.067 0.012 0.005 | -0.022
## 34 | -0.454 0.463 0.146 | -0.024 0.002 0.000 | -0.464
## 35 | 0.714 1.145 0.260 | 0.330 0.298 0.055 | 0.105
## 36 | -0.398 0.355 0.187 | -0.240 0.159 0.068 | -0.210
## 37 | -0.218 0.107 0.044 | 0.035 0.003 0.001 | -0.294
## 38 | -0.670 1.008 0.406 | 0.030 0.002 0.001 | -0.314
## 39 | 0.363 0.297 0.091 | -0.219 0.131 0.033 | 0.018
## 40 | -0.047 0.005 0.002 | 0.164 0.074 0.022 | -0.165
## 41 | -0.329 0.243 0.110 | -0.300 0.247 0.092 | -0.082
## 42 | -0.068 0.010 0.002 | -0.063 0.011 0.002 | -0.355
## 43 | -0.119 0.032 0.016 | -0.320 0.281 0.115 | -0.138
## 44 | 0.715 1.150 0.174 | -0.034 0.003 0.000 | -0.814
## 45 | -0.214 0.103 0.070 | -0.118 0.038 0.021 | 0.179
## 46 | 0.565 0.716 0.215 | -0.432 0.512 0.126 | 0.052
## 47 | -0.471 0.498 0.084 | -0.123 0.042 0.006 | 0.477
## 48 | 0.172 0.066 0.034 | 0.057 0.009 0.004 | 0.115
## 49 | -0.127 0.037 0.011 | -0.182 0.091 0.022 | -0.373
## 50 | 0.621 0.867 0.271 | 0.337 0.311 0.080 | -0.553
## 51 | 0.381 0.327 0.151 | -0.118 0.038 0.015 | -0.003
## 52 | 0.301 0.204 0.069 | 0.337 0.311 0.087 | 0.064
## 53 | 0.185 0.077 0.019 | 0.842 1.946 0.383 | 0.238
## 54 | -0.414 0.384 0.106 | 0.266 0.194 0.044 | 0.234
## 55 | -0.591 0.786 0.243 | -0.050 0.007 0.002 | 0.144
## 56 | 0.481 0.521 0.208 | -0.075 0.016 0.005 | 0.185
## 57 | 0.110 0.027 0.009 | 0.020 0.001 0.000 | -0.238
## 58 | 0.214 0.103 0.043 | -0.144 0.057 0.020 | -0.019
## 59 | 0.487 0.533 0.111 | 0.164 0.073 0.013 | 0.113
## 60 | 0.324 0.236 0.073 | -0.226 0.140 0.035 | -0.285
## 61 | 0.877 1.730 0.498 | -0.204 0.115 0.027 | -0.196
## 62 | 0.236 0.125 0.028 | 0.427 0.501 0.091 | 0.313
## 63 | -0.388 0.338 0.227 | -0.114 0.035 0.020 | -0.245
## 64 | 0.396 0.352 0.115 | -0.102 0.029 0.008 | 0.023
## 65 | 0.031 0.002 0.001 | -0.174 0.083 0.036 | 0.360
## 66 | 0.772 1.341 0.150 | 0.501 0.689 0.063 | -0.436
## 67 | 0.860 1.662 0.533 | -0.081 0.018 0.005 | -0.072
## 68 | 0.587 0.774 0.180 | -0.163 0.073 0.014 | 0.575
## 69 | 0.334 0.251 0.070 | -0.322 0.284 0.065 | 0.080
## 70 | -0.350 0.276 0.144 | -0.208 0.119 0.051 | 0.118
## 71 | -0.086 0.017 0.011 | -0.260 0.185 0.101 | 0.163
## 72 | 0.517 0.600 0.188 | -0.234 0.150 0.039 | 0.084
## 73 | -0.048 0.005 0.002 | -0.308 0.260 0.077 | 0.403
## 74 | -0.059 0.008 0.002 | 0.942 2.434 0.427 | 0.597
## 75 | 0.598 0.805 0.135 | -0.454 0.566 0.078 | -0.060
## 76 | -0.542 0.660 0.336 | -0.271 0.202 0.084 | -0.196
## 77 | -0.195 0.085 0.055 | -0.235 0.151 0.080 | 0.008
## 78 | 0.057 0.007 0.003 | -0.283 0.220 0.069 | -0.281
## 79 | -0.089 0.018 0.003 | 0.158 0.068 0.008 | 0.696
## 80 | 0.212 0.101 0.045 | -0.260 0.185 0.068 | -0.098
## 81 | -0.408 0.374 0.161 | 0.342 0.322 0.114 | -0.211
## 82 | -0.017 0.001 0.000 | 0.123 0.041 0.004 | 0.523
## 83 | -0.095 0.020 0.007 | 0.115 0.036 0.010 | 0.062
## 84 | -0.045 0.005 0.002 | -0.203 0.113 0.048 | 0.030
## 85 | -0.374 0.315 0.076 | -0.246 0.167 0.033 | 0.000
## 86 | 0.067 0.010 0.002 | -0.489 0.656 0.101 | 0.096
## 87 | 0.917 1.889 0.435 | 0.080 0.017 0.003 | 0.291
## 88 | 0.193 0.084 0.018 | -0.537 0.790 0.137 | 0.312
## 89 | 0.137 0.042 0.013 | -0.242 0.160 0.040 | -0.205
## 90 | 0.154 0.054 0.011 | -0.291 0.232 0.039 | 0.408
## 91 | 0.090 0.018 0.007 | -0.211 0.122 0.037 | -0.111
## 92 | 0.017 0.001 0.000 | -0.289 0.230 0.027 | 0.590
## 93 | 0.167 0.063 0.021 | 0.051 0.007 0.002 | -0.334
## 94 | 0.150 0.050 0.007 | 0.494 0.669 0.073 | 0.606
## 95 | -0.028 0.002 0.000 | 0.902 2.231 0.499 | 0.409
## 96 | 0.004 0.000 0.000 | -0.163 0.073 0.021 | -0.335
## 97 | -0.542 0.660 0.336 | -0.271 0.202 0.084 | -0.196
## 98 | 0.151 0.052 0.028 | -0.342 0.322 0.146 | -0.159
## 99 | 0.715 1.150 0.296 | -0.298 0.244 0.051 | -0.245
## 100 | 0.193 0.083 0.018 | 0.961 2.534 0.455 | 0.104
## 101 | 0.376 0.318 0.093 | -0.308 0.260 0.062 | 0.486
## 102 | -0.507 0.577 0.204 | 0.083 0.019 0.006 | 0.137
## 103 | 0.520 0.608 0.195 | -0.191 0.101 0.026 | -0.108
## 104 | -0.201 0.090 0.047 | -0.329 0.297 0.128 | -0.034
## 105 | -0.174 0.068 0.010 | -0.255 0.179 0.021 | 0.297
## 106 | -0.111 0.028 0.015 | -0.319 0.280 0.128 | 0.295
## 107 | -0.111 0.028 0.015 | -0.319 0.280 0.128 | 0.295
## 108 | 0.593 0.791 0.238 | -0.282 0.219 0.054 | 0.482
## 109 | 0.022 0.001 0.001 | -0.156 0.067 0.030 | -0.035
## 110 | 0.250 0.141 0.041 | 0.301 0.248 0.060 | 0.379
## 111 | -0.291 0.190 0.082 | -0.148 0.060 0.021 | -0.008
## 112 | 0.172 0.066 0.034 | 0.057 0.009 0.004 | 0.115
## 113 | -0.164 0.061 0.029 | 0.222 0.135 0.053 | -0.042
## 114 | -0.106 0.025 0.007 | -0.336 0.309 0.069 | 0.411
## 115 | -0.348 0.272 0.174 | -0.256 0.180 0.094 | 0.191
## 116 | -0.332 0.248 0.058 | -0.204 0.114 0.022 | 0.200
## 117 | -0.303 0.206 0.083 | -0.167 0.077 0.025 | 0.461
## 118 | -0.229 0.118 0.021 | -0.294 0.238 0.035 | 0.477
## 119 | -0.187 0.078 0.036 | -0.135 0.050 0.019 | -0.096
## 120 | 0.244 0.134 0.056 | 0.296 0.240 0.083 | -0.129
## 121 | -0.231 0.120 0.082 | -0.176 0.085 0.047 | 0.106
## 122 | 0.685 1.056 0.263 | -0.056 0.009 0.002 | 0.319
## 123 | -0.330 0.244 0.142 | -0.195 0.104 0.050 | 0.487
## 124 | -0.400 0.359 0.232 | -0.181 0.090 0.048 | 0.360
## 125 | 0.051 0.006 0.002 | -0.423 0.490 0.153 | 0.254
## 126 | -0.234 0.124 0.042 | -0.301 0.249 0.069 | 0.112
## 127 | 0.773 1.345 0.205 | -0.093 0.024 0.003 | -0.761
## 128 | 0.245 0.135 0.034 | -0.057 0.009 0.002 | -0.542
## 129 | 0.031 0.002 0.001 | -0.337 0.311 0.126 | 0.154
## 130 | 0.187 0.079 0.027 | -0.416 0.474 0.136 | 0.110
## 131 | -0.014 0.000 0.000 | -0.290 0.230 0.099 | 0.087
## 132 | -0.205 0.094 0.036 | -0.189 0.098 0.031 | -0.312
## 133 | 0.559 0.704 0.194 | -0.136 0.050 0.011 | 0.558
## 134 | -0.461 0.478 0.110 | -0.006 0.000 0.000 | 0.015
## 135 | 0.459 0.473 0.068 | -0.100 0.027 0.003 | -0.357
## 136 | 0.482 0.523 0.140 | -0.198 0.108 0.024 | 0.114
## 137 | -0.201 0.090 0.022 | -0.251 0.173 0.034 | 0.424
## 138 | -0.014 0.000 0.000 | -0.217 0.129 0.043 | -0.015
## 139 | -0.364 0.298 0.195 | -0.125 0.043 0.023 | 0.190
## 140 | -0.364 0.298 0.195 | -0.125 0.043 0.023 | 0.190
## 141 | -0.311 0.217 0.127 | -0.288 0.228 0.109 | 0.013
## 142 | -0.199 0.089 0.015 | -0.352 0.340 0.045 | 0.805
## 143 | 0.028 0.002 0.001 | -0.264 0.191 0.091 | 0.278
## 144 | -0.280 0.176 0.093 | -0.183 0.092 0.040 | 0.007
## 145 | -0.291 0.191 0.088 | -0.314 0.270 0.102 | -0.182
## 146 | -0.400 0.359 0.232 | -0.181 0.090 0.048 | 0.360
## 147 | -0.060 0.008 0.004 | -0.239 0.156 0.068 | 0.259
## 148 | 0.003 0.000 0.000 | 0.356 0.349 0.043 | -0.372
## 149 | -0.166 0.062 0.010 | 0.193 0.102 0.014 | 0.352
## 150 | -0.268 0.161 0.048 | -0.224 0.138 0.033 | 0.389
## 151 | -0.105 0.025 0.004 | -0.010 0.000 0.000 | -0.576
## 152 | -0.450 0.456 0.312 | -0.120 0.040 0.022 | -0.146
## 153 | -0.471 0.500 0.065 | 0.347 0.331 0.035 | -0.534
## 154 | -0.520 0.608 0.263 | -0.012 0.000 0.000 | -0.189
## 155 | -0.443 0.441 0.224 | -0.213 0.124 0.052 | 0.264
## 156 | -0.450 0.455 0.064 | -0.061 0.010 0.001 | -0.646
## 157 | -0.274 0.168 0.051 | -0.301 0.248 0.061 | -0.061
## 158 | -0.400 0.359 0.232 | -0.181 0.090 0.048 | 0.360
## 159 | -0.121 0.033 0.010 | 0.356 0.348 0.087 | -0.514
## 160 | 0.278 0.174 0.062 | -0.111 0.034 0.010 | -0.076
## 161 | -0.165 0.061 0.017 | 0.185 0.094 0.021 | 0.161
## 162 | -0.461 0.478 0.236 | -0.052 0.007 0.003 | -0.356
## 163 | -0.413 0.384 0.149 | 0.127 0.044 0.014 | -0.247
## 164 | -0.275 0.170 0.087 | 0.059 0.010 0.004 | -0.416
## 165 | -0.460 0.477 0.126 | 0.772 1.637 0.356 | 0.020
## 166 | -0.450 0.456 0.312 | -0.120 0.040 0.022 | -0.146
## 167 | -0.588 0.778 0.324 | 0.054 0.008 0.003 | -0.229
## 168 | -0.042 0.004 0.001 | 1.138 3.550 0.522 | 0.142
## 169 | -0.399 0.358 0.217 | -0.133 0.048 0.024 | 0.224
## 170 | -0.090 0.018 0.009 | -0.243 0.161 0.065 | 0.060
## 171 | 0.274 0.169 0.045 | 0.438 0.526 0.114 | 0.095
## 172 | 0.481 0.521 0.224 | -0.052 0.008 0.003 | -0.049
## 173 | 0.238 0.127 0.040 | 0.194 0.103 0.026 | -0.372
## 174 | -0.124 0.035 0.015 | -0.310 0.264 0.091 | 0.264
## 175 | -0.203 0.093 0.047 | -0.247 0.168 0.069 | 0.068
## 176 | -0.367 0.303 0.140 | -0.223 0.136 0.051 | 0.158
## 177 | -0.304 0.208 0.108 | -0.259 0.185 0.078 | -0.053
## 178 | -0.045 0.005 0.003 | -0.021 0.001 0.001 | 0.158
## 179 | -0.093 0.019 0.008 | -0.304 0.253 0.082 | 0.104
## 180 | -0.442 0.440 0.146 | 0.094 0.024 0.007 | 0.261
## 181 | -0.278 0.174 0.033 | -0.283 0.219 0.034 | -0.338
## 182 | -0.287 0.185 0.046 | 1.071 3.147 0.638 | -0.347
## 183 | -0.408 0.374 0.117 | -0.075 0.016 0.004 | -0.636
## 184 | -0.398 0.355 0.187 | -0.240 0.159 0.068 | -0.210
## 185 | 0.149 0.050 0.021 | -0.160 0.070 0.024 | 0.166
## 186 | -0.507 0.578 0.224 | 0.121 0.040 0.013 | -0.032
## 187 | -0.705 1.118 0.447 | -0.026 0.002 0.001 | -0.144
## 188 | -0.624 0.875 0.362 | -0.002 0.000 0.000 | -0.059
## 189 | -0.367 0.302 0.075 | 0.056 0.009 0.002 | 0.193
## 190 | 0.067 0.010 0.002 | 1.149 3.619 0.484 | 0.298
## 191 | -0.188 0.080 0.017 | 0.430 0.508 0.089 | 0.302
## 192 | -0.108 0.026 0.011 | -0.205 0.115 0.039 | 0.410
## 193 | -0.575 0.744 0.111 | -0.117 0.037 0.005 | 0.123
## 194 | -0.281 0.177 0.031 | -0.142 0.056 0.008 | 0.743
## 195 | -0.426 0.407 0.087 | 1.139 3.557 0.624 | -0.178
## 196 | 0.146 0.048 0.017 | 0.224 0.138 0.040 | -0.245
## 197 | -0.486 0.531 0.061 | -0.008 0.000 0.000 | 0.932
## 198 | 0.036 0.003 0.001 | 0.064 0.011 0.002 | 0.337
## 199 | -0.376 0.317 0.074 | 0.744 1.519 0.290 | -0.145
## 200 | -0.705 1.118 0.447 | -0.026 0.002 0.001 | -0.144
## 201 | 0.196 0.086 0.019 | -0.016 0.001 0.000 | 0.150
## 202 | 0.167 0.062 0.015 | 1.114 3.401 0.651 | -0.179
## 203 | 0.066 0.010 0.002 | 0.384 0.404 0.078 | 0.352
## 204 | 0.042 0.004 0.001 | -0.092 0.023 0.002 | 0.031
## 205 | -0.669 1.005 0.255 | 0.175 0.084 0.018 | 0.089
## 206 | 0.032 0.002 0.001 | 0.491 0.661 0.163 | -0.415
## 207 | 0.686 1.057 0.272 | 0.192 0.101 0.021 | -0.197
## 208 | 0.006 0.000 0.000 | 1.059 3.074 0.488 | 0.469
## 209 | -0.302 0.205 0.063 | 0.362 0.360 0.091 | 0.328
## 210 | 0.096 0.021 0.011 | -0.343 0.323 0.135 | 0.086
## 211 | -0.276 0.171 0.031 | 1.110 3.380 0.500 | 0.054
## 212 | -0.464 0.484 0.077 | 0.830 1.889 0.245 | 0.056
## 213 | -0.363 0.296 0.152 | 0.030 0.002 0.001 | -0.198
## 214 | -0.136 0.042 0.021 | -0.105 0.030 0.013 | 0.010
## 215 | 0.213 0.102 0.025 | -0.342 0.321 0.064 | 0.061
## 216 | -0.221 0.110 0.043 | -0.269 0.199 0.064 | -0.171
## 217 | -0.211 0.100 0.029 | 0.212 0.123 0.029 | -0.112
## 218 | -0.228 0.117 0.060 | -0.157 0.068 0.028 | 0.284
## 219 | -0.542 0.660 0.336 | -0.271 0.202 0.084 | -0.196
## 220 | 0.400 0.359 0.062 | 0.966 2.561 0.360 | 0.160
## 221 | 0.519 0.605 0.150 | -0.339 0.316 0.064 | 0.252
## 222 | -0.297 0.198 0.039 | 0.707 1.370 0.219 | 0.155
## 223 | -0.567 0.722 0.239 | -0.055 0.008 0.002 | 0.342
## 224 | 0.467 0.491 0.145 | 0.292 0.234 0.057 | 0.072
## 225 | 0.441 0.437 0.131 | 0.084 0.019 0.005 | -0.621
## 226 | -0.358 0.287 0.119 | 0.080 0.018 0.006 | 0.093
## 227 | -0.485 0.529 0.094 | 1.037 2.949 0.427 | 0.021
## 228 | -0.336 0.254 0.157 | -0.111 0.034 0.017 | -0.104
## 229 | 0.157 0.056 0.010 | 0.972 2.592 0.371 | 0.377
## 230 | -0.275 0.170 0.101 | -0.232 0.148 0.072 | -0.156
## 231 | -0.356 0.286 0.140 | 0.035 0.003 0.001 | -0.501
## 232 | -0.408 0.375 0.131 | -0.182 0.091 0.026 | -0.383
## 233 | 0.309 0.215 0.103 | -0.091 0.023 0.009 | -0.162
## 234 | 0.636 0.908 0.317 | -0.115 0.036 0.010 | -0.416
## 235 | 0.284 0.181 0.076 | -0.035 0.003 0.001 | 0.145
## 236 | 0.227 0.116 0.047 | -0.356 0.348 0.117 | 0.004
## 237 | -0.538 0.651 0.168 | -0.208 0.119 0.025 | -0.061
## 238 | 0.165 0.061 0.021 | 0.197 0.107 0.030 | 0.164
## 239 | 0.456 0.468 0.117 | -0.289 0.229 0.047 | 0.041
## 240 | 0.537 0.648 0.185 | 0.247 0.167 0.039 | -0.053
## 241 | 0.144 0.047 0.016 | 0.207 0.118 0.034 | 0.217
## 242 | -0.048 0.005 0.002 | -0.257 0.181 0.063 | -0.452
## 243 | 0.141 0.045 0.011 | 0.008 0.000 0.000 | 0.135
## 244 | 0.478 0.515 0.177 | -0.065 0.012 0.003 | -0.141
## 245 | -0.184 0.076 0.048 | -0.342 0.320 0.166 | 0.006
## 246 | 0.413 0.383 0.162 | -0.174 0.083 0.029 | -0.374
## 247 | -0.014 0.000 0.000 | -0.183 0.092 0.034 | -0.165
## 248 | 0.443 0.440 0.088 | 0.752 1.552 0.254 | 0.216
## 249 | -0.210 0.099 0.021 | 0.729 1.459 0.255 | 0.015
## 250 | -0.228 0.117 0.061 | -0.258 0.183 0.079 | -0.162
## 251 | 0.671 1.013 0.359 | -0.210 0.121 0.035 | 0.267
## 252 | 0.739 1.229 0.339 | -0.040 0.004 0.001 | 0.277
## 253 | 0.308 0.213 0.090 | 0.025 0.002 0.001 | 0.210
## 254 | -0.276 0.171 0.105 | -0.259 0.184 0.093 | -0.077
## 255 | 0.139 0.043 0.013 | 0.084 0.019 0.005 | 0.325
## 256 | -0.286 0.184 0.081 | -0.164 0.074 0.027 | -0.367
## 257 | -0.333 0.250 0.121 | -0.194 0.103 0.041 | -0.371
## 258 | -0.341 0.262 0.055 | -0.197 0.107 0.018 | -0.070
## 259 | 0.864 1.680 0.537 | -0.192 0.101 0.026 | 0.159
## 260 | 0.028 0.002 0.001 | -0.164 0.074 0.042 | -0.055
## 261 | -0.016 0.001 0.000 | 0.592 0.961 0.180 | 0.249
## 262 | -0.681 1.043 0.393 | 0.010 0.000 0.000 | 0.155
## 263 | -0.352 0.279 0.057 | -0.190 0.099 0.016 | -0.698
## 264 | 0.534 0.641 0.218 | -0.192 0.101 0.028 | 0.427
## 265 | 0.990 2.204 0.624 | -0.117 0.038 0.009 | 0.256
## 266 | -0.009 0.000 0.000 | 0.139 0.053 0.016 | 0.384
## 267 | 0.751 1.268 0.311 | -0.237 0.154 0.031 | -0.418
## 268 | 0.867 1.689 0.419 | 0.038 0.004 0.001 | 0.088
## 269 | 0.423 0.402 0.136 | 0.199 0.109 0.030 | -0.140
## 270 | 0.797 1.427 0.394 | -0.202 0.112 0.025 | -0.361
## 271 | 0.334 0.250 0.079 | 0.420 0.484 0.126 | -0.140
## 272 | 0.099 0.022 0.006 | 0.701 1.346 0.306 | -0.172
## 273 | 1.230 3.402 0.451 | -0.233 0.148 0.016 | -0.035
## 274 | 0.375 0.317 0.053 | -0.366 0.368 0.050 | -0.213
## 275 | -0.237 0.127 0.078 | -0.049 0.006 0.003 | -0.373
## 276 | -0.011 0.000 0.000 | -0.273 0.205 0.079 | 0.327
## 277 | 0.404 0.367 0.110 | 0.406 0.452 0.111 | -0.013
## 278 | 0.837 1.577 0.366 | 0.269 0.198 0.038 | -0.041
## 279 | 0.376 0.318 0.104 | -0.097 0.026 0.007 | -0.194
## 280 | 0.266 0.160 0.068 | -0.094 0.024 0.008 | 0.142
## 281 | 0.135 0.041 0.018 | -0.258 0.183 0.065 | 0.110
## 282 | 0.124 0.035 0.017 | -0.274 0.206 0.082 | 0.309
## 283 | 0.774 1.347 0.451 | -0.007 0.000 0.000 | 0.036
## 284 | 0.006 0.000 0.000 | -0.115 0.036 0.011 | 0.009
## 285 | -0.418 0.393 0.282 | -0.242 0.161 0.095 | 0.064
## 286 | 0.338 0.257 0.116 | -0.224 0.138 0.051 | -0.103
## 287 | 0.441 0.438 0.094 | 0.518 0.737 0.130 | 0.164
## 288 | 0.012 0.000 0.000 | 0.361 0.358 0.100 | -0.213
## 289 | 0.136 0.041 0.005 | -0.386 0.409 0.041 | -0.015
## 290 | -0.082 0.015 0.003 | -0.463 0.588 0.085 | -0.217
## 291 | -0.067 0.010 0.006 | -0.006 0.000 0.000 | -0.054
## 292 | -0.201 0.091 0.012 | -0.171 0.080 0.009 | 0.262
## 293 | -0.281 0.177 0.071 | -0.386 0.408 0.134 | -0.006
## 294 | 0.041 0.004 0.001 | -0.494 0.668 0.106 | 0.193
## 295 | 0.775 1.350 0.386 | -0.039 0.004 0.001 | 0.149
## 296 | -0.269 0.162 0.063 | -0.027 0.002 0.001 | -0.254
## 297 | 0.323 0.235 0.047 | 0.766 1.610 0.262 | -0.146
## 298 | -0.100 0.022 0.012 | -0.238 0.155 0.071 | 0.365
## 299 | 0.024 0.001 0.000 | 0.633 1.099 0.218 | 0.021
## 300 | -0.545 0.668 0.182 | -0.175 0.084 0.019 | -0.514
## ctr cos2
## 1 0.347 0.046 |
## 2 1.483 0.409 |
## 3 0.224 0.038 |
## 4 0.153 0.030 |
## 5 0.000 0.000 |
## 6 0.001 0.000 |
## 7 0.159 0.077 |
## 8 0.655 0.174 |
## 9 1.070 0.244 |
## 10 0.861 0.158 |
## 11 0.178 0.024 |
## 12 0.116 0.047 |
## 13 1.243 0.202 |
## 14 0.388 0.112 |
## 15 1.738 0.369 |
## 16 0.086 0.018 |
## 17 0.517 0.111 |
## 18 0.315 0.094 |
## 19 0.477 0.097 |
## 20 0.446 0.112 |
## 21 0.002 0.000 |
## 22 2.801 0.242 |
## 23 0.245 0.025 |
## 24 0.489 0.129 |
## 25 0.193 0.040 |
## 26 0.021 0.002 |
## 27 0.023 0.008 |
## 28 0.001 0.000 |
## 29 0.306 0.120 |
## 30 0.154 0.048 |
## 31 0.150 0.023 |
## 32 0.001 0.000 |
## 33 0.002 0.001 |
## 34 0.797 0.153 |
## 35 0.041 0.006 |
## 36 0.164 0.052 |
## 37 0.320 0.080 |
## 38 0.365 0.089 |
## 39 0.001 0.000 |
## 40 0.101 0.023 |
## 41 0.025 0.007 |
## 42 0.466 0.064 |
## 43 0.070 0.021 |
## 44 2.452 0.225 |
## 45 0.119 0.049 |
## 46 0.010 0.002 |
## 47 0.842 0.086 |
## 48 0.049 0.015 |
## 49 0.514 0.095 |
## 50 1.131 0.215 |
## 51 0.000 0.000 |
## 52 0.015 0.003 |
## 53 0.210 0.031 |
## 54 0.203 0.034 |
## 55 0.077 0.014 |
## 56 0.127 0.031 |
## 57 0.209 0.041 |
## 58 0.001 0.000 |
## 59 0.047 0.006 |
## 60 0.300 0.056 |
## 61 0.143 0.025 |
## 62 0.362 0.049 |
## 63 0.222 0.091 |
## 64 0.002 0.000 |
## 65 0.479 0.152 |
## 66 0.704 0.048 |
## 67 0.019 0.004 |
## 68 1.225 0.173 |
## 69 0.024 0.004 |
## 70 0.052 0.016 |
## 71 0.099 0.040 |
## 72 0.026 0.005 |
## 73 0.601 0.132 |
## 74 1.322 0.172 |
## 75 0.013 0.001 |
## 76 0.143 0.044 |
## 77 0.000 0.000 |
## 78 0.292 0.068 |
## 79 1.794 0.155 |
## 80 0.035 0.010 |
## 81 0.165 0.043 |
## 82 1.011 0.071 |
## 83 0.014 0.003 |
## 84 0.003 0.001 |
## 85 0.000 0.000 |
## 86 0.034 0.004 |
## 87 0.313 0.044 |
## 88 0.361 0.046 |
## 89 0.155 0.028 |
## 90 0.618 0.077 |
## 91 0.046 0.010 |
## 92 1.288 0.112 |
## 93 0.414 0.084 |
## 94 1.361 0.110 |
## 95 0.620 0.103 |
## 96 0.415 0.088 |
## 97 0.143 0.044 |
## 98 0.094 0.032 |
## 99 0.223 0.035 |
## 100 0.040 0.005 |
## 101 0.874 0.155 |
## 102 0.070 0.015 |
## 103 0.043 0.008 |
## 104 0.004 0.001 |
## 105 0.327 0.028 |
## 106 0.322 0.109 |
## 107 0.322 0.109 |
## 108 0.862 0.158 |
## 109 0.005 0.001 |
## 110 0.532 0.095 |
## 111 0.000 0.000 |
## 112 0.049 0.015 |
## 113 0.007 0.002 |
## 114 0.624 0.103 |
## 115 0.135 0.052 |
## 116 0.149 0.021 |
## 117 0.788 0.192 |
## 118 0.841 0.091 |
## 119 0.034 0.010 |
## 120 0.062 0.016 |
## 121 0.042 0.017 |
## 122 0.376 0.057 |
## 123 0.879 0.310 |
## 124 0.480 0.188 |
## 125 0.239 0.055 |
## 126 0.046 0.010 |
## 127 2.146 0.199 |
## 128 1.088 0.168 |
## 129 0.087 0.026 |
## 130 0.045 0.009 |
## 131 0.028 0.009 |
## 132 0.359 0.084 |
## 133 1.154 0.193 |
## 134 0.001 0.000 |
## 135 0.471 0.041 |
## 136 0.048 0.008 |
## 137 0.667 0.098 |
## 138 0.001 0.000 |
## 139 0.134 0.053 |
## 140 0.134 0.053 |
## 141 0.001 0.000 |
## 142 2.403 0.238 |
## 143 0.285 0.100 |
## 144 0.000 0.000 |
## 145 0.122 0.034 |
## 146 0.480 0.188 |
## 147 0.249 0.080 |
## 148 0.514 0.047 |
## 149 0.460 0.046 |
## 150 0.560 0.101 |
## 151 1.227 0.110 |
## 152 0.079 0.033 |
## 153 1.054 0.084 |
## 154 0.132 0.035 |
## 155 0.258 0.079 |
## 156 1.546 0.132 |
## 157 0.014 0.002 |
## 158 0.480 0.188 |
## 159 0.980 0.181 |
## 160 0.021 0.005 |
## 161 0.096 0.016 |
## 162 0.469 0.140 |
## 163 0.226 0.053 |
## 164 0.642 0.200 |
## 165 0.002 0.000 |
## 166 0.079 0.033 |
## 167 0.194 0.049 |
## 168 0.074 0.008 |
## 169 0.186 0.068 |
## 170 0.013 0.004 |
## 171 0.034 0.005 |
## 172 0.009 0.002 |
## 173 0.511 0.097 |
## 174 0.259 0.066 |
## 175 0.017 0.005 |
## 176 0.092 0.026 |
## 177 0.011 0.003 |
## 178 0.092 0.034 |
## 179 0.040 0.010 |
## 180 0.253 0.051 |
## 181 0.422 0.049 |
## 182 0.446 0.067 |
## 183 1.497 0.286 |
## 184 0.164 0.052 |
## 185 0.102 0.026 |
## 186 0.004 0.001 |
## 187 0.077 0.019 |
## 188 0.013 0.003 |
## 189 0.138 0.021 |
## 190 0.328 0.032 |
## 191 0.338 0.044 |
## 192 0.624 0.156 |
## 193 0.056 0.005 |
## 194 2.044 0.216 |
## 195 0.117 0.015 |
## 196 0.221 0.048 |
## 197 3.218 0.223 |
## 198 0.421 0.067 |
## 199 0.078 0.011 |
## 200 0.077 0.019 |
## 201 0.083 0.011 |
## 202 0.119 0.017 |
## 203 0.460 0.066 |
## 204 0.003 0.000 |
## 205 0.029 0.005 |
## 206 0.639 0.116 |
## 207 0.144 0.023 |
## 208 0.815 0.096 |
## 209 0.398 0.074 |
## 210 0.027 0.008 |
## 211 0.011 0.001 |
## 212 0.012 0.001 |
## 213 0.146 0.045 |
## 214 0.000 0.000 |
## 215 0.014 0.002 |
## 216 0.108 0.026 |
## 217 0.047 0.008 |
## 218 0.300 0.093 |
## 219 0.143 0.044 |
## 220 0.095 0.010 |
## 221 0.234 0.035 |
## 222 0.089 0.011 |
## 223 0.434 0.087 |
## 224 0.019 0.003 |
## 225 1.430 0.261 |
## 226 0.032 0.008 |
## 227 0.002 0.000 |
## 228 0.040 0.015 |
## 229 0.528 0.056 |
## 230 0.091 0.032 |
## 231 0.930 0.278 |
## 232 0.544 0.115 |
## 233 0.097 0.028 |
## 234 0.641 0.136 |
## 235 0.078 0.020 |
## 236 0.000 0.000 |
## 237 0.014 0.002 |
## 238 0.099 0.021 |
## 239 0.006 0.001 |
## 240 0.010 0.002 |
## 241 0.174 0.037 |
## 242 0.758 0.194 |
## 243 0.068 0.010 |
## 244 0.073 0.015 |
## 245 0.000 0.000 |
## 246 0.519 0.133 |
## 247 0.101 0.028 |
## 248 0.173 0.021 |
## 249 0.001 0.000 |
## 250 0.097 0.031 |
## 251 0.264 0.057 |
## 252 0.283 0.047 |
## 253 0.163 0.042 |
## 254 0.022 0.008 |
## 255 0.392 0.072 |
## 256 0.498 0.133 |
## 257 0.511 0.150 |
## 258 0.018 0.002 |
## 259 0.094 0.018 |
## 260 0.011 0.005 |
## 261 0.230 0.032 |
## 262 0.089 0.020 |
## 263 1.805 0.223 |
## 264 0.676 0.139 |
## 265 0.242 0.042 |
## 266 0.545 0.121 |
## 267 0.646 0.096 |
## 268 0.029 0.004 |
## 269 0.073 0.015 |
## 270 0.482 0.081 |
## 271 0.073 0.014 |
## 272 0.110 0.018 |
## 273 0.005 0.000 |
## 274 0.167 0.017 |
## 275 0.514 0.192 |
## 276 0.397 0.113 |
## 277 0.001 0.000 |
## 278 0.006 0.001 |
## 279 0.139 0.028 |
## 280 0.074 0.019 |
## 281 0.045 0.012 |
## 282 0.353 0.104 |
## 283 0.005 0.001 |
## 284 0.000 0.000 |
## 285 0.015 0.007 |
## 286 0.039 0.011 |
## 287 0.100 0.013 |
## 288 0.167 0.035 |
## 289 0.001 0.000 |
## 290 0.174 0.019 |
## 291 0.011 0.004 |
## 292 0.254 0.021 |
## 293 0.000 0.000 |
## 294 0.137 0.016 |
## 295 0.083 0.014 |
## 296 0.238 0.056 |
## 297 0.079 0.009 |
## 298 0.492 0.167 |
## 299 0.002 0.000 |
## 300 0.980 0.162 |
##
## Categories
## Dim.1 ctr cos2 v.test Dim.2 ctr
## breakfast | 0.166 0.495 0.025 2.756 | -0.166 0.607
## Not.breakfast | -0.153 0.457 0.025 -2.756 | 0.154 0.560
## Not.tea time | -0.498 4.053 0.192 -7.578 | 0.093 0.174
## tea time | 0.386 3.142 0.192 7.578 | -0.072 0.135
## evening | 0.319 1.307 0.053 3.985 | -0.058 0.053
## Not.evening | -0.167 0.683 0.053 -3.985 | 0.030 0.028
## lunch | 0.659 2.385 0.075 4.722 | -0.390 1.018
## Not.lunch | -0.113 0.410 0.075 -4.722 | 0.067 0.175
## dinner | -0.661 1.146 0.033 -3.136 | 0.796 2.025
## Not.dinner | 0.050 0.086 0.033 3.136 | -0.060 0.152
## always | 0.293 1.102 0.045 3.660 | 0.041 0.027
## Not.always | -0.153 0.576 0.045 -3.660 | -0.022 0.014
## home | 0.013 0.006 0.005 1.260 | 0.002 0.000
## Not.home | -0.414 0.193 0.005 -1.260 | -0.062 0.005
## Not.work | -0.213 1.212 0.112 -5.775 | 0.133 0.575
## work | 0.523 2.967 0.112 5.775 | -0.326 1.407
## Not.tearoom | -0.299 2.694 0.372 -10.545 | -0.072 0.191
## tearoom | 1.246 11.240 0.372 10.545 | 0.300 0.796
## friends | 0.359 3.159 0.243 8.527 | -0.090 0.242
## Not.friends | -0.677 5.953 0.243 -8.527 | 0.170 0.456
## Not.resto | -0.285 2.234 0.226 -8.229 | 0.145 0.708
## resto | 0.796 6.251 0.226 8.229 | -0.406 1.981
## Not.pub | -0.198 1.158 0.147 -6.635 | 0.043 0.066
## pub | 0.744 4.358 0.147 6.635 | -0.161 0.248
## black | 0.108 0.108 0.004 1.071 | 0.383 1.657
## Earl Grey | 0.104 0.261 0.020 2.415 | -0.285 2.390
## green | -0.851 2.985 0.090 -5.173 | 0.808 3.282
## alone | -0.155 0.584 0.045 -3.648 | 0.069 0.140
## lemon | 0.478 0.941 0.028 2.905 | 0.306 0.472
## milk | 0.023 0.004 0.000 0.209 | -0.375 1.353
## other | 1.438 2.324 0.064 4.373 | 0.016 0.000
## No.sugar | 0.120 0.277 0.015 2.137 | 0.170 0.680
## sugar | -0.128 0.296 0.015 -2.137 | -0.181 0.727
## tea bag | -0.451 4.317 0.266 -8.916 | -0.417 4.494
## tea bag+unpackaged | 0.760 6.787 0.264 8.881 | 0.042 0.025
## unpackaged | 0.144 0.093 0.003 0.920 | 1.857 18.923
## chain store | -0.428 4.383 0.325 -9.857 | -0.396 4.592
## chain store+tea shop | 1.075 11.263 0.406 11.021 | 0.096 0.110
## tea shop | -0.060 0.013 0.000 -0.343 | 2.286 23.888
## p_branded | -0.503 3.002 0.117 -5.921 | -0.412 2.455
## p_cheap | -0.683 0.408 0.011 -1.826 | -0.268 0.076
## p_private label | -0.531 0.739 0.021 -2.518 | -0.478 0.732
## p_unknown | -0.323 0.157 0.004 -1.142 | -0.754 1.039
## p_upscale | 0.225 0.336 0.011 1.805 | 1.594 20.511
## p_variable | 0.497 3.454 0.147 6.632 | -0.218 0.808
## cos2 v.test Dim.3 ctr cos2 v.test
## breakfast 0.026 -2.764 | -0.483 6.900 0.215 -8.017 |
## Not.breakfast 0.026 2.764 | 0.445 6.369 0.215 8.017 |
## Not.tea time 0.007 1.423 | 0.265 1.886 0.054 4.027 |
## tea time 0.007 -1.423 | -0.205 1.462 0.054 -4.027 |
## evening 0.002 -0.728 | 0.451 4.312 0.106 5.640 |
## Not.evening 0.002 0.728 | -0.236 2.254 0.106 -5.640 |
## lunch 0.026 -2.793 | 0.301 0.822 0.016 2.160 |
## Not.lunch 0.026 2.793 | -0.052 0.141 0.016 -2.160 |
## dinner 0.048 3.774 | 0.535 1.235 0.022 2.537 |
## Not.dinner 0.048 -3.774 | -0.040 0.093 0.022 -2.537 |
## always 0.001 0.518 | 0.440 4.107 0.101 5.504 |
## Not.always 0.001 -0.518 | -0.230 2.147 0.101 -5.504 |
## home 0.000 0.189 | -0.064 0.249 0.134 -6.339 |
## Not.home 0.000 -0.189 | 2.085 8.047 0.134 6.339 |
## Not.work 0.043 3.600 | -0.045 0.089 0.005 -1.219 |
## work 0.043 -3.600 | 0.110 0.218 0.005 1.219 |
## Not.tearoom 0.022 -2.541 | 0.044 0.095 0.008 1.541 |
## tearoom 0.022 2.541 | -0.182 0.395 0.008 -1.541 |
## friends 0.015 -2.137 | 0.234 2.203 0.103 5.548 |
## Not.friends 0.015 2.137 | -0.440 4.151 0.103 -5.548 |
## Not.resto 0.059 4.194 | -0.027 0.032 0.002 -0.773 |
## resto 0.059 -4.194 | 0.075 0.091 0.002 0.773 |
## Not.pub 0.007 1.432 | -0.107 0.559 0.043 -3.590 |
## pub 0.007 -1.432 | 0.403 2.102 0.043 3.590 |
## black 0.048 3.793 | -0.977 14.546 0.313 -9.672 |
## Earl Grey 0.147 -6.621 | 0.386 5.905 0.268 8.955 |
## green 0.081 4.911 | -0.063 0.027 0.000 -0.386 |
## alone 0.009 1.620 | 0.142 0.809 0.037 3.346 |
## lemon 0.012 1.862 | 0.672 3.070 0.056 4.088 |
## milk 0.037 -3.347 | -0.589 4.500 0.092 -5.253 |
## other 0.000 0.047 | -1.417 3.719 0.062 -4.310 |
## No.sugar 0.031 3.034 | -0.444 6.273 0.210 -7.929 |
## sugar 0.031 -3.034 | 0.474 6.705 0.210 7.929 |
## tea bag 0.227 -8.237 | -0.010 0.003 0.000 -0.193 |
## tea bag+unpackaged 0.001 0.491 | -0.107 0.221 0.005 -1.248 |
## unpackaged 0.470 11.860 | 0.325 0.783 0.014 2.076 |
## chain store 0.279 -9.135 | 0.007 0.002 0.000 0.160 |
## chain store+tea shop 0.003 0.984 | -0.223 0.799 0.017 -2.286 |
## tea shop 0.581 13.177 | 0.536 1.771 0.032 3.087 |
## p_branded 0.079 -4.848 | 0.089 0.153 0.004 1.042 |
## p_cheap 0.002 -0.715 | -0.518 0.387 0.006 -1.385 |
## p_private label 0.017 -2.269 | -0.266 0.305 0.005 -1.261 |
## p_unknown 0.024 -2.661 | 0.129 0.041 0.001 0.454 |
## p_upscale 0.545 12.766 | -0.040 0.017 0.000 -0.319 |
## p_variable 0.028 -2.905 | 0.012 0.003 0.000 0.163 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.025 0.026 0.215 |
## tea.time | 0.192 0.007 0.054 |
## evening | 0.053 0.002 0.106 |
## lunch | 0.075 0.026 0.016 |
## dinner | 0.033 0.048 0.022 |
## always | 0.045 0.001 0.101 |
## home | 0.005 0.000 0.134 |
## work | 0.112 0.043 0.005 |
## tearoom | 0.372 0.022 0.008 |
## friends | 0.243 0.015 0.103 |
## resto | 0.226 0.059 0.002 |
## pub | 0.147 0.007 0.043 |
## Tea | 0.090 0.160 0.332 |
## How | 0.103 0.043 0.196 |
## sugar | 0.015 0.031 0.210 |
## how | 0.299 0.513 0.016 |
## where | 0.418 0.626 0.042 |
## price | 0.216 0.561 0.015 |
##
## Supplementary categories
## Dim.1 cos2 v.test Dim.2 cos2 v.test Dim.3
## F | 0.151 0.033 3.158 | -0.109 0.017 -2.278 | -0.048
## M | -0.221 0.033 -3.158 | 0.159 0.017 2.278 | 0.070
## employee | -0.153 0.006 -1.313 | -0.151 0.006 -1.289 | 0.103
## middle | -0.030 0.000 -0.205 | 0.336 0.017 2.281 | -0.284
## non-worker | -0.036 0.000 -0.324 | 0.185 0.009 1.666 | -0.291
## other worker | 0.040 0.000 0.187 | 0.013 0.000 0.061 | -0.063
## senior | 0.415 0.023 2.608 | 0.072 0.001 0.452 | -0.187
## student | 0.032 0.000 0.305 | -0.317 0.031 -3.022 | 0.394
## workman | -0.417 0.007 -1.473 | 0.249 0.003 0.878 | 0.343
## Not.sportsman | -0.030 0.001 -0.426 | 0.018 0.000 0.260 | -0.051
## sportsman | 0.020 0.001 0.426 | -0.012 0.000 -0.260 | 0.034
## 15-24 | -0.052 0.001 -0.601 | -0.393 0.068 -4.518 | 0.304
## 25-34 | -0.014 0.000 -0.132 | 0.107 0.003 1.008 | 0.410
## 35-44 | 0.193 0.006 1.312 | 0.120 0.002 0.814 | -0.343
## 45-59 | -0.081 0.002 -0.711 | 0.141 0.005 1.235 | -0.365
## +60 | 0.079 0.001 0.519 | 0.404 0.024 2.662 | -0.532
## 1/day | -0.318 0.047 -3.746 | -0.106 0.005 -1.248 | -0.130
## 1 to 2/week | -0.328 0.019 -2.354 | 0.104 0.002 0.742 | 0.561
## +2/day | 0.328 0.079 4.863 | 0.024 0.000 0.353 | -0.144
## 3 to 6/week | 0.088 0.001 0.545 | 0.073 0.001 0.454 | 0.174
## escape-exoticism | 0.000 0.000 0.001 | -0.091 0.007 -1.495 | 0.006
## Not.escape-exoticism | 0.000 0.000 -0.001 | 0.082 0.007 1.495 | -0.005
## Not.spirituality | -0.047 0.005 -1.199 | 0.009 0.000 0.233 | -0.086
## spirituality | 0.103 0.005 1.199 | -0.020 0.000 -0.233 | 0.188
## healthy | -0.007 0.000 -0.177 | -0.005 0.000 -0.124 | -0.059
## Not.healthy | 0.016 0.000 0.177 | 0.011 0.000 0.124 | 0.138
## diuretic | 0.053 0.004 1.068 | 0.017 0.000 0.350 | -0.097
## Not.diuretic | -0.073 0.004 -1.068 | -0.024 0.000 -0.350 | 0.134
## friendliness | 0.131 0.071 4.618 | -0.014 0.001 -0.508 | 0.055
## Not.friendliness | -0.546 0.071 -4.618 | 0.060 0.001 0.508 | -0.229
## iron absorption | 0.077 0.001 0.449 | 0.113 0.001 0.666 | 0.000
## Not.iron absorption | -0.009 0.001 -0.449 | -0.013 0.001 -0.666 | 0.000
## feminine | 0.048 0.002 0.725 | -0.029 0.001 -0.439 | 0.038
## Not.feminine | -0.036 0.002 -0.725 | 0.022 0.001 0.439 | -0.029
## Not.sophisticated | -0.059 0.001 -0.639 | -0.241 0.023 -2.622 | -0.093
## sophisticated | 0.023 0.001 0.639 | 0.095 0.023 2.622 | 0.037
## No.slimming | 0.026 0.004 1.061 | 0.027 0.004 1.114 | 0.021
## slimming | -0.146 0.004 -1.061 | -0.153 0.004 -1.114 | -0.116
## exciting | -0.077 0.004 -1.064 | -0.053 0.002 -0.729 | 0.196
## No.exciting | 0.049 0.004 1.064 | 0.033 0.002 0.729 | -0.124
## No.relaxing | -0.046 0.001 -0.615 | 0.078 0.004 1.051 | -0.056
## relaxing | 0.028 0.001 0.615 | -0.047 0.004 -1.051 | 0.034
## effect on health | -0.021 0.000 -0.196 | 0.155 0.007 1.428 | -0.065
## No.effect on health | 0.006 0.000 0.196 | -0.044 0.007 -1.428 | 0.018
## cos2 v.test
## F 0.003 -0.998 |
## M 0.003 0.998 |
## employee 0.003 0.884 |
## middle 0.012 -1.928 |
## non-worker 0.023 -2.620 |
## other worker 0.000 -0.289 |
## senior 0.005 -1.177 |
## student 0.047 3.760 |
## workman 0.005 1.209 |
## Not.sportsman 0.002 -0.721 |
## sportsman 0.002 0.721 |
## 15-24 0.041 3.494 |
## 25-34 0.050 3.873 |
## 35-44 0.018 -2.329 |
## 45-59 0.034 -3.190 |
## +60 0.041 -3.505 |
## 1/day 0.008 -1.530 |
## 1 to 2/week 0.054 4.020 |
## +2/day 0.015 -2.126 |
## 3 to 6/week 0.004 1.074 |
## escape-exoticism 0.000 0.092 |
## Not.escape-exoticism 0.000 -0.092 |
## Not.spirituality 0.016 -2.191 |
## spirituality 0.016 2.191 |
## healthy 0.008 -1.558 |
## Not.healthy 0.008 1.558 |
## diuretic 0.013 -1.973 |
## Not.diuretic 0.013 1.973 |
## friendliness 0.013 1.940 |
## Not.friendliness 0.013 -1.940 |
## iron absorption 0.000 -0.001 |
## Not.iron absorption 0.000 0.001 |
## feminine 0.001 0.574 |
## Not.feminine 0.001 -0.574 |
## Not.sophisticated 0.003 -1.014 |
## sophisticated 0.003 1.014 |
## No.slimming 0.002 0.845 |
## slimming 0.002 -0.845 |
## exciting 0.024 2.693 |
## No.exciting 0.024 -2.693 |
## No.relaxing 0.002 -0.746 |
## relaxing 0.002 0.746 |
## effect on health 0.001 -0.594 |
## No.effect on health 0.001 0.594 |
##
## Supplementary categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## sex | 0.033 0.017 0.003 |
## SPC | 0.032 0.053 0.076 |
## Sport | 0.001 0.000 0.002 |
## age_Q | 0.008 0.077 0.146 |
## frequency | 0.094 0.006 0.064 |
## escape.exoticism | 0.000 0.007 0.000 |
## spirituality | 0.005 0.000 0.016 |
## healthy | 0.000 0.000 0.008 |
## diuretic | 0.004 0.000 0.013 |
## friendliness | 0.071 0.001 0.013 |
## iron.absorption | 0.001 0.001 0.000 |
## feminine | 0.002 0.001 0.001 |
## sophisticated | 0.001 0.023 0.003 |
## slimming | 0.004 0.004 0.002 |
## exciting | 0.004 0.002 0.024 |
## relaxing | 0.001 0.004 0.002 |
## effect.on.health | 0.000 0.007 0.001 |
##
## Supplementary continuous variable
## Dim.1 Dim.2 Dim.3
## age | 0.042 | 0.204 | -0.340 |
First line in the output of the summary we see how we set up the MCA analysis. In this case we are mainly looking at tea drinking variables (1-18 column), whilst including the rest of the variables as supplementary info. Afterwards we see eigenvalues for the 18 dimensions. Dim1 principal component explains 9.9% of variance in entire dataset. Next we see the results for 76 participant results (coordinate values, contribution to the axes and squared cosine value - quality of representation). Contribution to the axes are small, but it has to be taken into account that all 300 individuals were included in the analysis. All of these results are shown for the first 3 dimensions separately. In the following paragraph description of categories is presented in the same way, only there is additional column of v.test which can be used to see which in categories there are large negative or positive values in the corresponding dimension. Largest contribution to dimension 1 is given by tearoom and chain store+tea shop categories. Lower we see the eta2 biggest value is for variable where thus showing the biggest correlation with dimension 1. Next information about the supplementary variables can be seen (no contribution since they are set as supplementary in this analysis).In the end the only quantitative variable age coordinate value is given.
#Variable biplot of the analysis
#Selected
fviz_mca_biplot(teaMCA,label ="var", col.ind="lightgray",
repel = TRUE,
labelsize = 3)
## Warning: ggrepel: 11 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 37 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Since the graph with all of the variable categories is too busy, to visualize the analysis I set the individual data points in the background and in grey color. No clear pattern or grouping emerge in the background individual data points. 18% of data are represented by first two dimensions - components. First component opposes tearoom, resto,lunch and pub with Not.friends, Not.home and Not.tea time. Thus opposing individuals who do not regularly drink tea and those who drink tea in many places. Second component distinquish teashop, unpackaged and p_upscale from the rest of the groups, which based on intuition could be true.
#All necessary libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(ez)
library(ggpubr)
##
## Attaching package: 'ggpubr'
## The following object is masked from 'package:cowplot':
##
## get_legend
library(broom)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
library(cowplot)
library(lattice)
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
library(gridExtra)
First I will look at summary measure approach to study longitudinal data. To do this I will use data from a study about rat nutrition. The study looks at how rats (each rat has unique ID) who were given one of three nutritionally different diets (Group) gained weight (value) over the 9-week study period(Time - in days). There were 16 rats in the study.
The data set used can be found here: https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt Study where the data originated: Crowder and Hand (1990).
#Read in the prepared data
RATSl <- read.csv("C:/Users/Localadmin_neimane/Desktop/Data Science/OpenDataScience/IODS-project/data/RATSl.txt", sep= ",")
##Set categorical variables as factors
RATSl$ID <- factor(RATSl$ID)
RATSl$Group <- factor(RATSl$Group)
RATSl$Time <- as.numeric(RATSl$Time)
#Look at data structure
str(RATSl)
## 'data.frame': 176 obs. of 4 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ value: int 240 250 255 260 262 258 266 266 265 272 ...
## $ Time : num 1 8 15 22 29 36 43 44 50 57 ...
Here you can see each individual rats weight change over the study period. The line is colored according to the diet group. We can see that rats in group 1 weigh much less and gained less weight over the study. One rat in group 2 is much heavier than the other rats in its group. It seems that all rats gain weight over the study period.
ggplot(RATSl, aes(x = Time, y =value, group = ID)) +
geom_line(aes(color = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Weight (grams)") +
theme(legend.position = "top")+
theme_bw()+
scale_colour_brewer(palette = "Dark2")
Now lets look at the mean response profile. Here I calculated the mean value from all individuals in the group at each time point and plotted it.
# Number of Times
n <- RATSl$Time %>% unique() %>% length()
#Calculate the mean and standard error for each group at all time points
RATSls <- RATSl %>%
group_by(Group, Time) %>%
summarise(mean = mean(value), se = sd(value)/sqrt(n) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# Plot the mean profiles
ggplot(RATSls, aes(x = Time, y = mean, linetype = Group, shape = Group, color = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,5)) +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.5)) +
scale_y_continuous(name = "mean(weight) +/- se(weight)")+
theme_bw()+
scale_colour_brewer(palette = "Dark2")
Visually it seems that diet 1 is the least nutritional of fattening (as I don’t know what would be the optimal weight). On average the group 3 was the heaviest at the end of the study. As there are already very big differences between groups on Time 1 I would suspect that either the rats did not randomly receive the diet treatment or were already some time under the treatment.
Before we continue let’s check the data.
Starting with checking outliers with boxplots.
ggplot(RATSls, aes(x = Group, y = mean)) +
geom_boxplot() +
theme_bw()+
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean weight")
## Warning: `fun.y` is deprecated. Use `fun` instead.
No outliers in the mean values.
Check normality:
ggqqplot(RATSls, "mean", ggtheme = theme_bw()) +
facet_grid(~ Group, labeller = "label_both")
Point fall on the line very well.
We saw visually a difference between groups and weight. Now let’s perform a test on the linear model to check the result.
model<- aov(mean ~ Group, data = RATSls)
# Compute the analysis of variance table for the fitted model with anova()
anova(model)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 437104 218552 1006.3 < 2.2e-16 ***
## Residuals 30 6516 217
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test confirms significant difference across the nutritional diet groups. Now let’s have a look at the diagnostic plots of the model to check the validity the results.
par(mfrow = c(2, 2))
plot(model)
Residuals vs fitted generated a horizontal line indicates a good linear relationship. In normal Q-Q plot point follow the line, maybe a slight overdispersion on the right side. Still I consider this to be fine and assume residual normal distribution. Scale-Location plot is not ideal, but it has to be taken into consideration that the dataset is small. Residual vs Levarage show no influential plots, Cook’s distance is not even shown on the graph. As it seems that we can trust the model results, let’s perform pair-wise comparison to see if the model indicates significant differences across the groups.
TukeyHSD(model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mean ~ Group, data = RATSls)
##
## $Group
## diff lwr upr p adj
## 2-1 220.98864 205.49700 236.48027 0e+00
## 3-1 262.07955 246.58791 277.57118 0e+00
## 3-2 41.09091 25.59927 56.58254 9e-07
Pair-wise comparison informs us that there is a significant difference across the 3 study group mean weights. Rats with the 3 nutritional diet weigh the most, while rats with 1 diet are the lightest.
Summary - mean value - approach is limited since we only look at the group mean values. Next we will look at more complex method to analyse longitudinal data.
The data set used to explore longitudinal data analysis with Linear Mixed Effect Model Approach consists of brief psychiatric rating scale (BPRS) results (value) of 40 men (subject) that received treatment 1 or treatment 2 (treatment) over the study period of 8 weeks (week). BPRS is used to evaluate person of having schizophrenia disorder and the question evaluate symptoms as hostility, mistrust and others. The more points, the more severe the symptoms.
The original data set can be found here: https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt Origin of data is Davis (2002) study.
#Read in the prepared data
BPRSl <- read.csv("C:/Users/Localadmin_neimane/Desktop/Data Science/OpenDataScience/IODS-project/data/BPRSl.txt", sep="\t")
##Set categorical variables as factors
BPRSl$treatment <- factor(BPRSl$treatment)
BPRSl$subject <- factor(BPRSl$subject)
BPRSl$week <- as.numeric(BPRSl$week)
BPRSl$value <- as.numeric(BPRSl$value)
#Look at data structure
str(BPRSl)
## 'data.frame': 360 obs. of 4 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ value : num 42 36 36 43 41 40 38 47 51 58 ...
## $ week : num 0 1 2 3 4 5 6 7 8 0 ...
Now let’s continue exploring the data by plotting the results of the study.
ggplot(BPRSl, aes(x = week, y = value, fill=treatment)) +
geom_boxplot()+
theme_bw()+
scale_fill_brewer(palette = "Dark2")
In the first graph we can see boxplots where data from all the participants from each treatment is included. We can see that the BPRS test result (value) decreases over the study period for both treatments. We can also see that there are few outliers at several time points.
Let’s continue exploring the data by plotting each participant changes in BPRS test results over time separately.
#Plot the data
TrueValues <- ggplot(BPRSl, aes(x = week, y = value, linetype = subject, color=subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both)+
theme_bw() + theme(legend.position = "none")+
scale_y_continuous(name = "BPRS test score", limits = c(10, 100), breaks = seq(25, 100, by = 25))
plot(TrueValues)
It seems that all of the participants experienced less symptoms after the study treatments. But as this is a busy graph and each individual had a different starting point (severeness of the symptoms). Let’s check the baseline BPRS score values for participants from both treatments.
#Read in the original data
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
#Rename week0
names(BPRS)[3]<-"baseline"
#Set treatment as factor
BPRS$treatment<-as.factor(BPRS$treatment)
#Plot the baseline value of the test
ggplot(BPRS, aes(x = treatment, y = baseline)) +
geom_boxplot()+
theme_bw()+
stat_summary(fun.y=mean,shape=1,col='red',geom='point')
## Warning: `fun.y` is deprecated. Use `fun` instead.
#Save data to use for linear modelling
BPRSAn <- BPRS
BPRSAn$subject <- as.factor(BPRSAn$subject)
BPRSAn <- pivot_longer(BPRSAn, week1:week8) #Make long table
BPRSAn$week <- sub("week","",BPRSAn$name) #Remove name week from the name
BPRSAn <- BPRSAn[ -c(4) ] #Remove column
BPRSAn$week<-as.numeric(BPRSAn$week)
The mean value of the BPRS score for participants from both treatments is similar, but the distribution is not. More participants in treatment 2 had higher BPRS value before the study than in treatment group 1.
Next I will calculate the difference from the baseline points (week0 - before starting the treatment result in the BPRS test) over the study period for each participant separately.
#Calculate difference between the test result at some point in the study and the result before the study
BPRS$week1 <- BPRS$week1-BPRS$baseline
BPRS$week2 <- BPRS$week2-BPRS$baseline
BPRS$week3 <- BPRS$week3-BPRS$baseline
BPRS$week4 <- BPRS$week4-BPRS$baseline
BPRS$week5 <- BPRS$week5-BPRS$baseline
BPRS$week6 <- BPRS$week6-BPRS$baseline
BPRS$week7 <- BPRS$week7-BPRS$baseline
BPRS$week8 <- BPRS$week8-BPRS$baseline
#Make long table
BPRS2l <- pivot_longer(BPRS, week1:week8)
BPRS2l$week <- sub("week","",BPRS2l$name)
BPRS2l$week<-as.numeric(BPRS2l$week)
BPRS2l$treatment<-as.factor(BPRS2l$treatment)
BPRS2l$subject<-as.factor(BPRS2l$subject)
#Plot the new data where the baseline test result is accounted for
ggplot(BPRS2l, aes(x = week, y = value, linetype = subject, color=subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme_bw() + theme(legend.position = "none")+
geom_hline(yintercept=0, color = "black", size=0.7, linetype="dashed")
This graph is easier to interpret since it takes into account that at the start of the study the participants had different severity of schizophrenia symptoms. If the line representing participant is below the dotted line it means that the participant had lower symptom severity than before the treatment. The line is above the baseline for two of the participants in treatment 1 and 3 participants under treatment 2.
Let’s first start by creating a random intercept model including and exluding baseline test value.
model1 <- lmer(value ~ treatment + week + (1 | subject), data = BPRSAn)
# Print the summary of the model
summary(model1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: value ~ treatment + week + (1 | subject)
## Data: BPRSAn
##
## REML criterion at convergence: 2410.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.5962 -0.6530 -0.1292 0.4677 3.5585
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 52.54 7.249
## Residual 96.97 9.847
## Number of obs: 320, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.7781 2.0981 21.819
## treatment2 0.3938 1.1009 0.358
## week -2.1354 0.2402 -8.888
##
## Correlation of Fixed Effects:
## (Intr) trtmn2
## treatment2 -0.262
## week -0.515 0.000
model1Base <- lmer(value ~ treatment + baseline + week + (1 | subject), data = BPRSAn)
summary(model1Base)
## Linear mixed model fit by REML ['lmerMod']
## Formula: value ~ treatment + baseline + week + (1 | subject)
## Data: BPRSAn
##
## REML criterion at convergence: 2292.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.23588 -0.61723 -0.08101 0.49379 3.06831
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 36.13 6.011
## Residual 66.11 8.131
## Number of obs: 320, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 21.0735 2.6682 7.898
## treatment2 -0.6575 0.9131 -0.720
## baseline 0.5256 0.0431 12.197
## week -2.1354 0.1984 -10.765
##
## Correlation of Fixed Effects:
## (Intr) trtmn2 baseln
## treatment2 -0.098
## baseline -0.759 -0.094
## week -0.335 0.000 0.000
anova(model1, model1Base)
## refitting model(s) with ML (instead of REML)
## Data: BPRSAn
## Models:
## model1: value ~ treatment + week + (1 | subject)
## model1Base: value ~ treatment + baseline + week + (1 | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## model1 5 2424.3 2443.1 -1207.1 2414.3
## model1Base 6 2302.8 2325.4 -1145.4 2290.8 123.44 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Inclusion of the baseline test value improves the model. Now let’s make a random intercept and random slope model.
model2 <- lmer(value ~ treatment + baseline + (week | subject), data = BPRSAn)
# Print the summary of the model
summary(model2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: value ~ treatment + baseline + (week | subject)
## Data: BPRSAn
##
## REML criterion at convergence: 2296.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.0749 -0.6046 -0.0540 0.5694 3.7311
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 99.186 9.959
## week 5.999 2.449 -0.81
## Residual 57.770 7.601
## Number of obs: 320, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 13.49053 2.38249 5.662
## treatment2 -0.66799 0.85362 -0.783
## baseline 0.53087 0.04041 13.139
##
## Correlation of Fixed Effects:
## (Intr) trtmn2
## treatment2 -0.102
## baseline -0.797 -0.095
Continue by considering interaction in the model of Group and Time.
model3 <- lmer(value ~ treatment * week + baseline + (week | subject), data = BPRSAn)
anova(model1, model1Base, model2, model3)
## refitting model(s) with ML (instead of REML)
## Data: BPRSAn
## Models:
## model1: value ~ treatment + week + (1 | subject)
## model1Base: value ~ treatment + baseline + week + (1 | subject)
## model2: value ~ treatment + baseline + (week | subject)
## model3: value ~ treatment * week + baseline + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## model1 5 2424.3 2443.1 -1207.1 2414.3
## model1Base 6 2302.8 2325.4 -1145.4 2290.8 123.443 1 < 2.2e-16 ***
## model2 7 2309.6 2336.0 -1147.8 2295.6 0.000 1 1
## model3 9 2280.4 2314.3 -1131.2 2262.4 33.214 2 6.135e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model3 has the lowest AIC, BIC score, deviance and higher logLik. Thus I will continue examine model3.
summary(model3)
## Linear mixed model fit by REML ['lmerMod']
## Formula: value ~ treatment * week + baseline + (week | subject)
## Data: BPRSAn
##
## REML criterion at convergence: 2263.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.0453 -0.6187 -0.0284 0.5746 3.7195
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 47.245 6.874
## week 1.574 1.254 -0.54
## Residual 55.884 7.476
## Number of obs: 320, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 23.46916 2.74890 8.538
## treatment2 -5.95830 1.84372 -3.232
## week -2.72321 0.38106 -7.146
## baseline 0.53094 0.03979 13.343
## treatment2:week 1.17560 0.36477 3.223
##
## Correlation of Fixed Effects:
## (Intr) trtmn2 week baseln
## treatment2 -0.305
## week -0.510 0.426
## baseline -0.680 -0.043 0.000
## tretmnt2:wk 0.299 -0.890 -0.479 0.000
We can see that there is substantial effect of the individual subject to the BPRS test score (47%). Still the large residual variance indicates that there is large within group variance. Fixed effect part model part shows that with longer time under the treatment (week) the severity of schizophrenia symptoms decreases if evaluated with the BPRS test. Also, larger baseline test value results in higher test score trough the study. Categorical predictor class treatment has a coefficient of -5.96 meaning that the test score of participants in treatment2 is 5.96 lower than the mean of participant test scores in treatment 1 group. The given correlation coefficients indicate that this model could have problem with multicollinearity, in this case the model is valid for predictions, but regression coefficients may be faulty and it can be difficult to correctly separate the effect of the regressiors on the result.
Let’s see visually how the model3 fitted values resemble the original observed values.
Fitted <- fitted(model3)
# Create a new column fitted to RATSL
BPRSAn <- BPRSAn %>%
mutate(Fitted)
# draw the plot of RATSL
Modelled <- ggplot(BPRSAn, aes(x = week, y = Fitted, group = subject)) +
geom_line(aes(linetype=subject, color=subject)) +
scale_x_continuous(name = "weeks", breaks = seq(1, 8, 1)) +
scale_y_continuous(name = "BPRS test score", limits = c(10, 100), breaks = seq(25, 100, by = 25))+
facet_grid(. ~ treatment, labeller = label_both)+
theme_bw()+theme(legend.position = "none")
#Model predictions and original value plots
plot_grid(TrueValues + theme(legend.position = "none") + labs(subtitle = "Values obtained in the study"),
Modelled + labs(subtitle = "Model fitted values"),
labels = NULL)
In the graph on the left you can see the true observed BPRS test score values, while on the right one can see the model3 fitted values. Just as in the original data the spread of predicted values is bigger for treatment 2 values. We can see that model predicted values for treatment 1 in the beginning in the study is under evaluated for some of the individuals. Still it seems that model captured the value spread differences across the two study treatments.
Let’s see if the model chosen is valid.
#Linearity
plot(resid(model3),BPRSAn$value)
It is hard to tell if the graph of observed and residual value is truly random.
#Homogeneity of Variance
plot(model3)
Model seems to have even spread around the central line.
#Normal distribution of the residuals
qqmath(model3, id=0.05)
Most of the points seem fine, there are some deviations on the tails. Maybe points 4, 11 and 1 could be excluded. Let’s see if one of most common transformation -log- will improve the relative normality of the data.
# Log tranformation of the response variable
BPRSAn$valueLog <- log10(BPRSAn$value)
model3log <- lmer(valueLog ~ treatment * week + baseline + (week | subject), data = BPRSAn)
#Check how the log transformed model fits to normality assumptions
plot(resid(model3log),BPRSAn$valueLog)
plot(model3log)
qqmath(model3log, id=0.05)
Log transformation slightly improves data fit to normality, but I think that the original model3 does not violate the normality assumptions too much so we can stick with the results from it. If we were to base our result on the model with log transformation data interpretation would not be so straight forward.